Journal Home > Volume 26 , Issue 3

Achieving faster performance without increasing power and energy consumption for computing systems is an outstanding challenge. This paper develops a novel resource allocation scheme for memory-bound applications running on High-Performance Computing (HPC) clusters, aiming to improve application performance without breaching peak power constraints and total energy consumption. Our scheme estimates how the number of processor cores and CPU frequency setting affects the application performance. It then uses the estimate to provide additional compute nodes to memory-bound applications if it is profitable to do so. We implement and apply our algorithm to 12 representative benchmarks from the NAS parallel benchmark and HPC Challenge (HPCC) benchmark suites and evaluate it on a representative HPC cluster. Experimental results show that our approach can effectively mitigate memory contention to improve application performance, and it achieves this without significantly increasing the peak power and overall energy consumption. Our approach obtains on average 12.69% performance improvement over the default resource allocation strategy, but uses 7.06% less total power, which translates into 17.77% energy savings.


menu
Abstract
Full text
Outline
About this article

More Bang for Your Buck: Boosting Performance with Capped Power Consumption

Show Author's information Juan Chen( )Xinxin QiFeihao WuJianbin FangYong DongYuan YuanZheng WangKeqin Li
College of Computer, National University of Defense Technology, Changsha 410073, China.
College of Computer, University of Leeds, London LS2 9JT, UK.
School of Science and Engineering, State University of New York, New York, NY 12561, USA.

Abstract

Achieving faster performance without increasing power and energy consumption for computing systems is an outstanding challenge. This paper develops a novel resource allocation scheme for memory-bound applications running on High-Performance Computing (HPC) clusters, aiming to improve application performance without breaching peak power constraints and total energy consumption. Our scheme estimates how the number of processor cores and CPU frequency setting affects the application performance. It then uses the estimate to provide additional compute nodes to memory-bound applications if it is profitable to do so. We implement and apply our algorithm to 12 representative benchmarks from the NAS parallel benchmark and HPC Challenge (HPCC) benchmark suites and evaluate it on a representative HPC cluster. Experimental results show that our approach can effectively mitigate memory contention to improve application performance, and it achieves this without significantly increasing the peak power and overall energy consumption. Our approach obtains on average 12.69% performance improvement over the default resource allocation strategy, but uses 7.06% less total power, which translates into 17.77% energy savings.

Keywords: energy efficiency, high-performance computing, performance boost, power control, processor frequency scaling

References(34)

[1]
R. H. Dennard, F. H. Gaensslen, H. N. Yu, V. L. Rideout, E. Bassous, and A. R. LeBlanc, Design of ion-implanted MOSFET’s with very small physical dimensions, IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256-268, 1974.
[2]
M. Bohr, A 30 year retrospective on Dennard’s MOSFET scaling paper, IEEE Solid-State Circuits Society Newsletter, vol. 12, no. 1, pp. 11-13, 2007.
[3]
R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen, Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction, in Proc. 36th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO 36), San Diego, CA, USA, 2003, pp. 81-92.
DOI
[4]
R. Kumar, V. Zyuban, and D. M. Tullsen, Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling, ACM SIGARCH Computer Architecture News, vol. 33, no. 2, pp. 408-419, 2005.
[5]
R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan, Heterogeneous chip multiprocessors, Computer, vol. 38, no. 11, pp. 32-38, 2005.
[6]
T. Heath, B. Diniz, E. V. Carrera, W. Meira, and R. Bianchini, Energy conservation in heterogeneous server clusters, in Proc. 10th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, Chicago, IL, USA, 2005, pp. 186-195.
DOI
[7]
Y. M. Li, K. Skadron, D. Brooks, and Z. G. Hu, Performance, energy, and thermal considerations for SMT and CMP architectures, in Proc. 11th Int. Symp. High-Performance Computer Architecture, San Francisco, CA, USA, 2005, pp. 71-82.
[8]
A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. Dreslinski, T. F. Wenisch, and S. Mahlke, Composite cores: Pushing heterogeneity into a core, in 2012 45th Annu. IEEE/ACM Int. Symp. Microarchitecture, Vancouver, Canada, 2012, pp. 317-328
DOI
[9]
T. S. Muthukaruppan, M. Pricopi, V. Venkataramani, T. Mitra, and S. Vishin, Hierarchical power management for asymmetric multi-core in dark silicon era, in 2013 50th ACM/EDAC/IEEE Design Automation Conf. (DAC), Austin, TX, USA, 2013, pp. 1-9.
DOI
[10]
J. Meng, K. Kawakami, and A. K. Coskun, Optimizing energy efficiency of 3-D multicore systems with stacked DRAM under power and thermal constraints, in Proc. 49th Annu. Design Automation Conf., San Francisco, CA, USA, 2012, pp. 648-655.
DOI
[11]
T. Cao, S. M. Blackburn, T. J. Gao, and K. S. McKinley, The Yin and Yang of power and performance for asymmetric hardware and managed software, in 2012 39th Annu. Int. Symp. Computer Architecture (ISCA), Portland, OR, USA, 2012, pp. 225-236.
DOI
[12]
N. Gholkar, F. Mueller, and B. Rountree, Power tuning HPC jobs on power-constrained systems, in Proc. 2016 Int. Conf. Parallel Architectures and Compilation, Haifa, Israel, 2016, pp. 179-191.
DOI
[13]
T. Patki, D. K. Lowenthal, A. Sasidharan, M. Maiterth, B. L. Rountree, M. Schulz, and B. R. de Supinski, Practical resource management in power-constrained, high performance computing, in Proc. 24th Int. Symp. High-Performance Parallel and Distributed Computing, Portland, OR, USA, 2015, pp. 121-132.
DOI
[14]
C. Isci, A. Buyuktosunoglu, C. Y. Cher, P. Bose, and M. Martonosi, An analysis of efficient multi-core global power management policies: Maximizing performance for a given power budget, in 2006 39th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO’06), Orlando, FL, USA, 2006, pp. 347-358.
DOI
[15]
S. Pagani, J. J. Chen, and M. M. Li, Energy efficiency on multi-core architectures with multiple voltage islands, IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 6, pp. 1608-1621, 2015.
[16]
S. W. Williams, A. Waterman, and D. A. Patterson, Roofline: An insightful visual performance model for multicore architectures, Communications of the ACM, vol. 52, no. 4, pp. 65-76, 2009.
[17]
K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, The Landscape of Parallel Computing Research: A View from Berkeley, Electrical Engineering and Computer Sciences, Tech. Rep. UCB/EECS-2006-183, University of California at Berkeley, Berkeley, CA, USA, 2006.
[18]
P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas, R. Rabenseifner, and D. Takahashi, The HPC Challenge (HPCC) benchmark suite, in Proc. 2006 ACM/IEEE Conf. Supercomputing (SC’06), Tampa, FL, USA, 2006, p. 213.
DOI
[19]
[20]
M. Dimitrov, Intel® Power Governor, https://software.intel.com/en-us/articles/intel-power-governor, 2012.
[21]
V. Viswanathan, Intel® Memory Latency Checker v3.8, https://software.intel.com/en-us/articles/intelr-memory-latency-checker, 2013.
[22]
B. Rountree, D. K. Lowenthal, B. R. de Supinski, M. Schulz, V. W. Freeh, and T. Bletsch, Adagio: Making DVS practical for complex HPC applications, in Proc. 23rd Int. Conf. Supercomputing, New York, NY, USA, 2009, pp. 460-469.
DOI
[23]
W. Wang, A. Porterfield, J. Cavazos, and S. Bhalachandra, Using per-loop CPU clock modulation for energy efficiency in OpenMP applications, presented at the 2015 44th Int. Conf. Parallel Processing, Beijing, China, 2015, pp. 629-638.
DOI
[24]
S. Bhalachandra, A. Porterfield, S. L. Olivier, and J. F. Prins, An adaptive core-specific runtime for energy efficiency, peesented at 2017 IEEE Int. Parallel and Distributed Processing Symp. (IPDPS), Orlando, FL, USA, 2017, pp. 947-956.
DOI
[25]
I. Stamelakos, S. Xydis, G. Palermo, and C. Silvano, Variation-aware voltage island formation for power efficient near-threshold manycore architectures, presented at the 2014 19th Asia and South Pacific Design Automation Conf. (ASP-DAC), Singapore, 2014, pp. 304-310.
DOI
[26]
U. R. Karpuzcu, A. Sinkar, N. S. Kim, and J. Torrellas, EnergySmart: Toward energy-efficient manycores for near-threshold computing, presented at 2013 IEEE 19th Int. Symp. High Performance Computer Architecture (HPCA), Shenzhen, China, 2013, pp. 542-553.
DOI
[27]
R. Begum, D. Werner, M. Hempstead, G. Prasad, and G. Challen, Energy-performance trade-offs on energy-constrained devices with multi-component DVFS, presented at 2015 IEEE Int. Symp. Workload Characterization, Atlanta, GA, USA, 2015, pp. 34-43.
DOI
[28]
Q. X. Liu, M. Moreto, J. Abella, F. J. Cazorla, and M. Valero, DReAM: An approach to estimate per-task DRAM energy in multicore systems, ACM Transactions on Design Automation of Electronic Systems, vol. 22, no. 1, p. 16, 2016.
[29]
A. Tiwari, M. Schulz, and L. Carrington, Predicting optimal power allocation for CPU and DRAM domains, in 2015 IEEE Int. Parallel and Distributed Processing Symp. Workshop, Hyderabad, India, 2015, pp. 951-959.
DOI
[30]
H. Z. Zhang and H. Hoffmann, Maximizing performance under a power cap: A comparison of hardware, software, and hybrid techniques, in Proc. 21st Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS’16), Atlanta, GA, USA, 2016, pp. 545-559.
DOI
[31]
P. F. Zou, T. Allen, C. H. Davis, X. Z. Feng, and R. Ge, CLIP: Cluster-level intelligent power coordination for power-bounded systems, presented at the 2017 IEEE Int. Conf. Cluster Computing (CLUSTER), Honolulu, HI, USA, 2017, pp. 541-551.
DOI
[32]
T. Patki, D. K. Lowenthal, B. Rountree, M. Schulz, and B. R. de Supinski, Exploring hardware overprovisioning in power-constrained, high performance computing, in Proc. 27th Int. ACM Conf. Int. Conf. Supercomputing (ICS’13 ), Eugene, OR, USA, 2013, pp. 173-182.
DOI
[33]
D. Lo and C. Kozyrakis, Dynamic management of TurboMode in modern multi-core chips, presented at 2014 IEEE 20th Int. Symp. High Performance Computer Architecture (HPCA), Orlando, FL, USA, 2014, pp. 603-613.
DOI
[34]
H. B. Jang, J. Lee, J. Kong, T. Suh, and S. W. Chung, Leveraging process variation for performance and energy: In the perspective of overclocking, IEEE Transactions on Computers, vol. 63, no. 5, pp. 1316-1322, 2014.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 25 March 2020
Accepted: 02 April 2020
Published: 12 October 2020
Issue date: June 2021

Copyright

© The author(s) 2021.

Acknowledgements

This work was supported in part by the Advanced Research Project of China (No. 31511010203) and the Research Program of NUDT (No. ZK18-03-10).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return