Journal Home > Volume 21 , Issue 5

Energy efficiency is one of the most important issues for High Performance Computing (HPC) today. Heterogeneous HPC platform with some energy-efficient customizable cores (as application-specific accelerators) is believed as one of the promising solutions to meet ever-increasing computing needs and to overcome power density limitations. In this paper, we focus on using customizable processor cores to optimize the typical stencil computations — the kernel of many high-performance applications. We develop a series of effective software/hardware co-optimization strategies to exploit the instruction-level and memory-computation parallelism, as well as to decrease the energy consumption. These optimizations include loop tiling, prefetching, cache customization, Single Instruction Multiple Data (SIMD), and Direct Memory Access (DMA), as well as necessary ISA extensions. Detailed tests of power-efficiency are given to evaluate the effect of all these optimizations comprehensively. The results are impressive: the combination of these optimizations has improved the application performance by 341% while the energy consumption has been decreased by 35%; a preliminary comparison with X86, GPU, and FPGA platforms also showed that the design could achieve an order of magnitude higher performance efficiency. We believe this work can help understand sources of inefficiency in general-purpose chips and can be used as a beginning to customize an energy efficient CMP for further improvement.


menu
Abstract
Full text
Outline
About this article

HW/SW Co-optimization for Stencil Computation: Beginning with a Customizable Core

Show Author's information Yanhua LiYouhui Zhang( )Weiming Zheng
Department of Computer Science, Tsinghua University, Beijing 100084, China.

Abstract

Energy efficiency is one of the most important issues for High Performance Computing (HPC) today. Heterogeneous HPC platform with some energy-efficient customizable cores (as application-specific accelerators) is believed as one of the promising solutions to meet ever-increasing computing needs and to overcome power density limitations. In this paper, we focus on using customizable processor cores to optimize the typical stencil computations — the kernel of many high-performance applications. We develop a series of effective software/hardware co-optimization strategies to exploit the instruction-level and memory-computation parallelism, as well as to decrease the energy consumption. These optimizations include loop tiling, prefetching, cache customization, Single Instruction Multiple Data (SIMD), and Direct Memory Access (DMA), as well as necessary ISA extensions. Detailed tests of power-efficiency are given to evaluate the effect of all these optimizations comprehensively. The results are impressive: the combination of these optimizations has improved the application performance by 341% while the energy consumption has been decreased by 35%; a preliminary comparison with X86, GPU, and FPGA platforms also showed that the design could achieve an order of magnitude higher performance efficiency. We believe this work can help understand sources of inefficiency in general-purpose chips and can be used as a beginning to customize an energy efficient CMP for further improvement.

Keywords: energy efficiency, customizable processor, stencil computation, software and hardware co-optimization

References(27)

[1]
Modeling and simulation at the exascale for energy and the environment, Report on the Advanced Scientific Computing Research Town Hall Meetings on Simulation and Modeling at the Exascale for Energy, Ecological Sustainability and Global Security (E3), http://www.sc.doe.gov/ascr/programDocuments/ProgDocs.html, 2007.
[2]
Tally S., New green supercomputer powers up at Purdue, http://www.purdue.edu/uns/x/2008a/080610McCartneySI-Cortex.html, 2008.
[3]
Xia F., Dou Y., Lei G., and Tan Y., FPGA accelerator for protein secondary structure prediction based on the GOR algorithm, BMC Bioinformatics, vol. 12, no. S1, p. S5, 2011.
[4]
Jiang J., Mirian V., Tang K. P., Chow P., and Xing Z., Matrix multiplication based on scalable macro-pipelined FPGA accelerator architecture, in 2009 International Conference on Reconfigurable Computing and FPGAs, 2009, pp. 48–53.
[5]
Liu L., Neal O., Chitlur B., Wang Q., Alvin C., Shen W., Yu Z., Arthur S., Ian M., Joseph G., et al., High-performance, energy efficient platforms using in-socket FPGA accelerators, in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2009, pp. 261-264.
[6]
Wehner M., Oliker L., and Shalf J., Towards ultra-high resolution models of climate and weather, International Journal of High Performance Computing Applications, vol. 22, no. 2, pp. 149-165, 2008.
[7]
Wehner M., Oliker L., and Shalf J., Green flash: Designing an energy efficient climate super-computer, in IEEE International Symposium on Parallel & Distributed Processing, 2009, 2009.
DOI
[8]
Shaw D. E., Dror R. O., Salmon J. K., Salmon J. K., Grossman J. P., Mackenzie K. M., Bank J. A., and Chow E., Millisecond-scale molecular dynamics simulations on anton, in Proceedings of the ACM/IEEE Conference on Supercomputing, 2009, pp. 1-11.
DOI
[9]
Hameed R., Qadeer W., Wachs M., Azizi O., Solomatnikov A., Lee B. C., and Horowitz M., Understanding sources of inefficiency in general-purpose chips, in Proceedings of the 37th Annual International Symposium on Computer Architecture, vol. 38, no. 3, pp. 37–47, 2010.
DOI
[10]
Cadence, Tensilica processors, http://ip.cadence.com/knowledgecenter/know-ten, 2016.
[11]
Cong J., Sarkar V., Reinman G., and Bui A., Customizable domain-specific computing, IEEE Design and Test of Computers, vol. 28, no. 2, pp. 5-15, 2011.
[12]
Fujitsu, K computer, http://www.fujitsu.com/global/about/tech/k/, 2016.
[13]
Levesque J., Larkin J., Foster M., Glenski J., Geissler G., Whalen S., and Wasserman H., Understanding and mitigating multicore performance issues on the AMD opteron architecture, Techincal Report, Lawrence Berkeley National Laboratory, 2007.
DOI
[14]
Atasu K., Luk W., Mencer O., Ozturan C., and Dundar G., FISH: Fast instruction synthesis for custom processors, IEEE Transactions on Very Large Scale Integratiions (VLSI) Systems, vol. 20, no. 1, pp. 52-65, 2012.
[15]
Grad M. and Plessl C., Pruning the design space for just-in-time processor customization, in International Conference on Reconfigurable Computing and FPGAs (ReConFig), 2010, pp. 67-72.
DOI
[16]
Datta K., Murphy M., Volkov V., Williams S., Carter J., Oliker L., and Yelick K., Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, in Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008, p. 4.
DOI
[17]
Membarth R., Hannig F., Teich J., and Kostler H., Towards domain-specific computing for stencil codes in HPC, in High Performance Computing, Networking, Storage and Analysis (SCC), Lake City, UT, USA, 2012, pp. 1133-1138.
DOI
[18]
Berger M. and Oliger J., Adaptive mesh refinement for hyperbolic partial differential equations, Journal of Computational Physics, vol. 53, pp. 484-512, 1984.
[19]
Zhang Y. and Mueller F., Auto-generation and auto-tuning of 3-D stencil codes on GPU clusters, in Proceedings of the Tenth International Symposium on Code Generation and Optimization, 2012, pp. 155-164.
DOI
[20]
Kamil S., Datta K., Williams S., Oliker L., Shalf J., and Yelick K., Implicit and explicit optimizations for stencil computations, in ACM SIGPLAN Workshop Memory Systems Performance and Correctness, San Jose, CA, USA, 2006, pp. 51-60.
DOI
[21]
Rivera G. and Tseng C., Tiling optimizations for 3-D scientific computations, in Proceedings of ACM/IEEE 2000 Conference on Supercomputing, 2000, p. 32.
DOI
[22]
Coleman S. and McKinley K., Tile size selection using cache organization and data layout, ACM SIGPLAN Notices, vol. 30, no. 6, pp. 279-290, 1995.
[23]
Bondhugula U., Hartono A., Ramanujam J., and Sadayappan P., A practical automatic polyhedral parallelizer and locality optimizer, ACM SIGPLAN Notices, vol. 43, no. 6, pp. 101-113, 2008.
[24]
Phillips E. and Fatic M., Implementing the himeno benchmark with CUDA on GPU clusters, in 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), 2010, pp. 1-10.
DOI
[25]
Yang Y., Cui H., Feng X., and Xue J., A hybrid circular queue method for iterative stencil computations on GPUs, Journal of Computer Science and Technology, vol. 27, pp. 57-74, 2012.
[26]
Araya-Polo M., Cabezas J., Hanzich M., Pericas M., Rubio F., Gelado I., and Cela J. M., Assessing accelerator based HPC reverse time migration, IEEE Transactions on Parallel and Distributed Systems, vol. 22, pp. 147-162, 2011.
[27]
Niu X., Jin Q., Luk W., Liu Q., and Pell O., Exploiting runtime reconfiguration in stencil computation, in Proceedings of 22nd International Confererence Field Programmable Logic and Applications (FPL), 2012, pp. 173-180.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 07 September 2015
Accepted: 03 February 2016
Published: 18 October 2016
Issue date: October 2016

Copyright

© The author(s) 2016

Acknowledgements

The work was supported by the National High-Tech Research and Development (863) Program of China (No. 2013AA01A215) and the Brain Inspired Computing Research of Tsinghua University (No. 20141080934).

Rights and permissions

Return