HW/SW Co-optimization for Stencil Computation: Beginning with a Customizable Core

Yanhua Li; Youhui Zhang; Weiming Zheng

doi:10.1109/TST.2016.7590326

| Sign up

PDF (517.7 KB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (8)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Tables (3)

Table 1

Table 2

Table 3

Open Access

HW/SW Co-optimization for Stencil Computation: Beginning with a Customizable Core

Yanhua Li, Youhui Zhang(), Weiming Zheng

Department of Computer Science, Tsinghua University, Beijing 100084, China.

Show Author Information

Abstract

Energy efficiency is one of the most important issues for High Performance Computing (HPC) today. Heterogeneous HPC platform with some energy-efficient customizable cores (as application-specific accelerators) is believed as one of the promising solutions to meet ever-increasing computing needs and to overcome power density limitations. In this paper, we focus on using customizable processor cores to optimize the typical stencil computations — the kernel of many high-performance applications. We develop a series of effective software/hardware co-optimization strategies to exploit the instruction-level and memory-computation parallelism, as well as to decrease the energy consumption. These optimizations include loop tiling, prefetching, cache customization, Single Instruction Multiple Data (SIMD), and Direct Memory Access (DMA), as well as necessary ISA extensions. Detailed tests of power-efficiency are given to evaluate the effect of all these optimizations comprehensively. The results are impressive: the combination of these optimizations has improved the application performance by 341% while the energy consumption has been decreased by 35%; a preliminary comparison with X86, GPU, and FPGA platforms also showed that the design could achieve an order of magnitude higher performance efficiency. We believe this work can help understand sources of inefficiency in general-purpose chips and can be used as a beginning to customize an energy efficient CMP for further improvement.

Keywords

energy efficiency customizable processor stencil computation software and hardware co-optimization

References

[1]

Modeling and simulation at the exascale for energy and the environment, Report on the Advanced Scientific Computing Research Town Hall Meetings on Simulation and Modeling at the Exascale for Energy, Ecological Sustainability and Global Security (E3), http://www.sc.doe.gov/ascr/programDocuments/ProgDocs.html, 2007.

[2]

Tally

, New green supercomputer powers up at Purdue, http://www.purdue.edu/uns/x/2008a/080610McCartneySI-Cortex.html, 2008.

[3]

Xia

, Dou

, Lei

, and Tan

, FPGA accelerator for protein secondary structure prediction based on the GOR algorithm, BMC Bioinformatics, vol. 12, no. S1, p. S5, 2011.

Crossref Google Scholar

[4]

Jiang

, Mirian

, Tang

K. P.

, Chow

, and Xing

, Matrix multiplication based on scalable macro-pipelined FPGA accelerator architecture, in

2009

International Conference on Reconfigurable Computing and FPGAs, 2009, pp. 48–53.

Crossref Google Scholar

[5]

Liu

, Neal

, Chitlur

, Wang

, Alvin

, Shen

, Yu

, Arthur

, Ian

, Joseph

, et al., High-performance, energy efficient platforms using in-socket FPGA accelerators, in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2009, pp. 261-264.

[6]

Wehner

, Oliker

, and Shalf

, Towards ultra-high resolution models of climate and weather, International Journal of High Performance Computing Applications, vol. 22, no. 2, pp. 149-165, 2008.

Crossref Google Scholar

[7]

Wehner

, Oliker

, and Shalf

, Green flash: Designing an energy efficient climate super-computer, in IEEE International Symposium on Parallel & Distributed Processing, 2009, 2009.

Crossref

[8]

Shaw

D. E.

, Dror

R. O.

, Salmon

J. K.

, Salmon

J. K.

, Grossman

J. P.

, Mackenzie

K. M.

, Bank

J. A.

, and Chow

, Millisecond-scale molecular dynamics simulations on anton, in Proceedings of the ACM/IEEE Conference on Supercomputing, 2009, pp. 1-11.

Crossref

[9]

Hameed

, Qadeer

, Wachs

, Azizi

, Solomatnikov

, Lee

B. C.

, and Horowitz

, Understanding sources of inefficiency in general-purpose chips, in Proceedings of the 37th Annual International Symposium on Computer Architecture, vol. 38, no. 3, pp. 37–47, 2010.

Crossref

[10]

Cadence, Tensilica processors, http://ip.cadence.com/knowledgecenter/know-ten, 2016.

[11]

Cong

, Sarkar

, Reinman

, and Bui

, Customizable domain-specific computing, IEEE Design and Test of Computers, vol. 28, no. 2, pp. 5-15, 2011.

Crossref Google Scholar

[12]

Fujitsu, K computer, http://www.fujitsu.com/global/about/tech/k/, 2016.

[13]

Levesque

, Larkin

, Foster

, Glenski

, Geissler

, Whalen

, and Wasserman

, Understanding and mitigating multicore performance issues on the AMD opteron architecture, Techincal Report, Lawrence Berkeley National Laboratory, 2007.

Crossref

[14]

Atasu

, Luk

, Mencer

, Ozturan

, and Dundar

, FISH: Fast instruction synthesis for custom processors, IEEE Transactions on Very Large Scale Integratiions (VLSI) Systems, vol. 20, no. 1, pp. 52-65, 2012.

Crossref Google Scholar

[15]

Grad

and Plessl

, Pruning the design space for just-in-time processor customization, in International Conference on Reconfigurable Computing and FPGAs (ReConFig), 2010, pp. 67-72.

Crossref

[16]

Datta

, Murphy

, Volkov

, Williams

, Carter

, Oliker

, and Yelick

, Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, in Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008, p. 4.

Crossref

[17]

Membarth

, Hannig

, Teich

, and Kostler

, Towards domain-specific computing for stencil codes in HPC, in High Performance Computing, Networking, Storage and Analysis (SCC), Lake City, UT, USA, 2012, pp. 1133-1138.

Crossref

[18]

Berger

and Oliger

, Adaptive mesh refinement for hyperbolic partial differential equations, Journal of Computational Physics, vol. 53, pp. 484-512, 1984.

Crossref Google Scholar

[19]

Zhang

and Mueller

, Auto-generation and auto-tuning of 3-D stencil codes on GPU clusters, in Proceedings of the Tenth International Symposium on Code Generation and Optimization, 2012, pp. 155-164.

Crossref

[20]

Kamil

, Datta

, Williams

, Oliker

, Shalf

, and Yelick

, Implicit and explicit optimizations for stencil computations, in ACM SIGPLAN Workshop Memory Systems Performance and Correctness, San Jose, CA, USA, 2006, pp. 51-60.

Crossref

[21]

Rivera

and Tseng

, Tiling optimizations for 3-D scientific computations, in Proceedings of ACM/IEEE 2000 Conference on Supercomputing, 2000, p. 32.

Crossref

[22]

Coleman

and McKinley

, Tile size selection using cache organization and data layout, ACM SIGPLAN Notices, vol. 30, no. 6, pp. 279-290, 1995.

Crossref Google Scholar

[23]

Bondhugula

, Hartono

, Ramanujam

, and Sadayappan

, A practical automatic polyhedral parallelizer and locality optimizer, ACM SIGPLAN Notices, vol. 43, no. 6, pp. 101-113, 2008.

Crossref Google Scholar

[24]

Phillips

and Fatic

, Implementing the himeno benchmark with CUDA on GPU clusters, in 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), 2010, pp. 1-10.

Crossref

[25]

Yang

, Cui

, Feng

, and Xue

, A hybrid circular queue method for iterative stencil computations on GPUs, Journal of Computer Science and Technology, vol. 27, pp. 57-74, 2012.

Crossref Google Scholar

[26]

Araya-Polo

, Cabezas

, Hanzich

, Pericas

, Rubio

, Gelado

, and Cela

J. M.

, Assessing accelerator based HPC reverse time migration, IEEE Transactions on Parallel and Distributed Systems, vol. 22, pp. 147-162, 2011.

Crossref Google Scholar

[27]

Niu

, Jin

, Luk

, Liu

, and Pell

, Exploiting runtime reconfiguration in stencil computation, in Proceedings of 22nd International Confererence Field Programmable Logic and Applications (FPL), 2012, pp. 173-180.

Crossref

Tsinghua Science and Technology

Volume 21 Issue 5,
October 2016

Pages 570-580

DOI: 10.1109/TST.2016.7590326

Cite this article:

Li Y, Zhang Y, Zheng W. HW/SW Co-optimization for Stencil Computation: Beginning with a Customizable Core. Tsinghua Science and Technology, 2016, 21(5): 570-580. https://doi.org/10.1109/TST.2016.7590326

	Frequency (GHz)	Number of cycles ( $\times 10^{9}$ )	Energy (mJ)
Case Naïve	1.399	0.159	25.851
Case Tiling	1.398	0.065	14.277

	Power(W)	Chip area(mm $^{2}$ )	Frequency(GHz)	Application performance (s)
Our design	0.5	0.914	1.395	33
X5560	24	66	2.8	18

		Throughput(Gflop/s)	Efficiency(Mflop $\cdot$ s ${}^{- 1}\cdot$ W $^{- 1}$ ))
Our design		0.8	1660.0
GPU	Datta et al.^[16]	36.0	76.5
	Phillips and Fatic^[24]	51.2	n/a
	Yang et al.^[25]	64.5	n/a
FPGA	Araya-Polo et al.^[26]	35.7	n/a
FPGA	Niu et al.^[27]	102.8	785.0