Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion for Bus-Based Embedded SoCs

Daming Zhang; Yongpan Liu; Shuangchen Li; Tongda Wu; Huazhong Yang

doi:10.1109/TST.2015.7350017

| Sign up

PDF (2.9 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (16)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Tables (4)

Table 1

Table 2

Table 3

Table 4

Open Access

Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion for Bus-Based Embedded SoCs

Daming Zhang, Yongpan Liu(), Shuangchen Li, Tongda Wu, Huazhong Yang

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China.

Department of Electronic and Computer Engineering, University of California, Santa Barbara, CA 93106, USA.

Show Author Information

Abstract

As performance requirements for bus-based embedded System-on-Chips (SoCs) increase, more and more on-chip application-specific hardware accelerators (e.g., filters, FFTs, JPEG encoders, GSMs, and AES encoders) are being integrated into their designs. These accelerators require system-level tradeoffs among performance, area, and scalability. Accelerator parallelization and Point-to-Point (P2P) interconnect insertion are two effective system-level adjustments. The former helps to boost the computing performance at the cost of area, while the latter provides higher bandwidth at the cost of routability. What's more, they interact with each other. This paper proposes a design flow to optimize accelerator parallelization and P2P interconnect insertion simultaneously. To explore the huge optimization space, we develop an effective algorithm, whose goal is to reduce total SoC latency under the constraints of SoC area and total P2P wire length. Experimental results show that the performance difference between our proposed algorithm and the optimal results is only 2.33% on average, while the running time of the algorithm is less than 17 s.

Keywords

accelerator parallelization point-to-point interconnect insertion bus-based embedded system-on-chips

References

[1]

, Liu

, Li

, Zou

, An

, Wang

, Hao

, A parallel low latency bus on chip for packet processing mpsoc, in International Conference on Solid-State and Integrated Circuit Technology (ICSICT), 2010, pp. 545–547.

Crossref

[2]

Ahmedy

, Wangy

, Klaibery

, Ahl

, Roblewskiy

, Simon

, Parallel hardware architecture for jpeg-ls based on domain decomposition, Proc. SPIE, Applications of Digital Image Processing, vol. 8499, no. 14, pp. 1–11, 2012.

Crossref Google Scholar

[3]

Sridhara

S. R.

, DiRenzo

, Lingam

, Lee

S. J.

, Blzquez

, Maxey

, Ghanem

, Lee

Y. H.

, Abdallah

, Singh

et al, Microwatt processor platform for medical system-on-chip applications, IEEE Journal of Solid-State Circuits (JSSC), vol. 46, no. 4, pp. 721–730, 2011.

Crossref Google Scholar

[4]

Kwong

, Chandrakasan

A. P.

, An energy-efficient biomedical signal processing platform, IEEE Journal of Solid-State Circuits (JSSC), vol. 46, no. 7, pp. 1742–1753, 2011.

Crossref Google Scholar

[5]

Zhang

, Zhang

, Silver

, Shakhsheer

, Nagaraju

, Klinefelter

, Pandey

J. N.

, Boley

, Carlson

E. J.

, Shrivastava

et al, A batteryless 19w mics/ism-band energy harvesting body area sensor node soc, in IEEE International Solid-state Circuits Conference (ISSCC), 2012, pp. 298–300.

Crossref

[6]

Goulding-Hotta

, Sampson

, Zheng

, Bhatt

, Auricchio

, Swanson

, Taylor

M. B.

, Greendroid: An architecture for the dark silicon age, in Asia and South Pacific Design Automation Conference (ASP-DAC), 2012, pp. 100–105.

Crossref

[7]

Corvino

, Diken

, Gamatie

, Jozwiak

, Transformation-based exploration of data parallel architecture for customizable hardware: A jpeg encoder case study, in Euromicro Conference on Digital System Design (DSD), 2012, pp. 774–781.

Crossref

[8]

Haris

, Sri

, Synthesis of heterogeneous pipelined multiprocessor systems using ilp: Jpeg case study, in International Conference on Hardware-Software Codesign and System Synthesis (CODES+ISSS), 2008, pp. 1–6.

[9]

Belhadj

, Bahri

, Ayed

M. B.

, Marrakchi

, Mehrez

, Data level parallelism for h264/avc baseline intra-prediction chain on mpsoc, in Multi-Conference on Systems, Signals and Devices (SSD), 2013, pp. 1–4.

Crossref

[10]

Hagiescu

, Wong

W. F.

, Bacon

D. F.

, Rabbah

, A computing origami: Folding streams in fpgas, in Design Automation Conference (DAC), 2009, pp. 282–287.

Crossref

[11]

, Liu

, Hu

, He

, Zhang

, Yang

, Optimal partition with block-level parallelization in c-to-rtl synthesis for streaming applications, in Asia and South Pacific Design Automation Conference (ASP-DAC), 2013, pp. 225–230.

[12]

Zuo

, Liang

, Li

, Rupnow

, Chen

, Cong

, Improving high level synthesis optimization opportunity through polyhedral transformations, in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2013, pp. 92–97.

Crossref

[13]

Vainbrand

, Ginosar

, Network-on-chip architectures for neural networks, in International Symposium on Networks-on-chip (NOCS), 2007, pp. 135–144.

[14]

Bertozzi

, Jalabert

, Murali

, Tamhankar

, Stergiou

, Benini

, Micheli

G. D.

, Noc synthesis flow for customized domain specific multiprocessor systems-on-chip, IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 16, no. 2, pp. 113–129, 2005.

Crossref Google Scholar

[15]

Lee

H. G.

, Ogras

U. Y.

, Marculescu

, Chang

, Design space exploration and prototyping for on-chip multimedia applications, in Design Automation Conference (DAC), 2006, pp. 137–142.

Crossref

[16]

Gladigau

, Gerstlauer

, Haubelt

, Streubhr

, Teich

, A system-level synthesis approach from formal application models to generic bus-based mpsocs, in International Conference on Embedded Computer Systems (SAMOS), 2010, pp. 118–125.

Crossref

[17]

Hempstead

, Wei

G. Y.

, Brooks

, An accelerator-based wireless sensor network processor in 130 nm cmos, IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), vol. 1, no. 2, pp. 193–202, 2011.

Crossref Google Scholar

[18]

Zahir

, Ewert

, Seshadri

, The medfield smartphone: Intel architecture in a handheld form factor, IEEE Micro, vol. 33, no. 6, pp. 38–46, 2013.

Crossref Google Scholar

[19]

Rose

, Samsung's 8-core exynos 5 octa processor: Your next phone will be fast, http://gizmodo.com/5974528/samsungs-new-exynos-processor-just-went-octa, 2013.

[20]

Hauser

, Olivier

, Connected device platform, Patent US20130303087A1, Nov. 14, 2013.

[21]

Bassam

, Toni

, Home automation system: A cheap and open-source alternative to control household appliances, http://www.diva-portal.org/smash/get/diva2:679674/FULLTEXT01.pdf, 2013.

[22]

Lee

H. G.

, Chang

, Ogras

U. Y.

, Marculescu

, On-chip communication architecture exploration: A quantitative evaluation of point-to-point, bus, and network-on-chip approaches, ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 12, no. 3, 2007.

Crossref Google Scholar

[23]

Pasricha

, Dutt

, Ben-Romdhane

, Constraint-driven bus matrix synthesis for mpsoc, in Asia and South Pacific Design Automation Conference (ASP-DAC), 2006, pp. 30–35.

Crossref

[24]

Tan

, Qiao

, Xia

, Yang

, Wang

, A functional model of systemc-based mpeg-2 decoder with heterogeneous multi-ip-cores and hybrid-interconnections architecture, in International Congress on Image and Signal Processing (CISP), 2009, pp. 1–5.

Crossref

[25]

Pham-Quoc

, Heisswolf

, Werner

, Al-Ars

, Becker

, Bertels

, Hybrid interconnect design for heterogeneous hardware accelerators, in Design, Automation and Test in Europe Conference and Exhibition (DATE), 2013, pp. 843–846.

Crossref

[26]

Vainbrand

, Ginosar

, Network-on-chip architectures for neural networks, in Symposium on Networks-on-Chip (NOCS), 2010, pp. 135–144.

Crossref

[27]

Zhu

, Liu

, Yin

, Dong

, Wei

, Tang

E. Y.

, Song

, Peng

, A 65 nm uneven-dual-core soc based platform for multi-device collaborative computing, in International Symposium on Circuits and Systems (ISCAS), 2014, pp. 2527–2530.

Crossref

[28]

Wei

, Sze

, Viswanathan

, Li

, Alpert

C. J.

, Reddy

, Huber

A. D.

, Tellez

G.E.

, Keller

, Sapatnekar

S. S.

, Glare: Global and local wiring aware routability evaluation, in Design Automation Conference (DAC), 2012, pp. 768–773.

Crossref

[29]

MIT, 48 half-hour excerpts of two-channel ambulatory ecg recordings, http://www.physionet.org/physiobank/database/mitdb/, 2013.

[30]

Zhang

, Image Engineering (I) Image Processing (2nd ed), Beijing, China: Tsinghua University Press, 2009.

Tsinghua Science and Technology

Volume 20 Issue 6,
December 2015

Pages 644-660

DOI: 10.1109/TST.2015.7350017

Cite this article:

Zhang D, Liu Y, Li S, et al. Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion for Bus-Based Embedded SoCs. Tsinghua Science and Technology, 2015, 20(6): 644-660. https://doi.org/10.1109/TST.2015.7350017

Type	Parameter	Description
ACC related	${in}_{i}$ / ${exe}_{i}$ / ${out}_{i}$	Input/Execution/Output delay per execution
	$n_{i}^{m}$	Execution times of ${ACC}_{i}$ in $G^{m}$
	$a_{i} / {ma}_{i}$	Area of ${ACC}_{i}$ and its local memory
	$d_{i, j}^{m}$	Data volume from ${ACC}_{i}$ to ${ACC}_{j}$ in $G^{m}$
SoC related	pa/mma	Area of the processor/main memory
	bt/pt	Transmission delay through the bus/a P2P channel
	bset/ba	Setting delay/Area of the bus
	Schedule	Schedule of application subgraphs
Opt. related	$p_{i}$	Parallel degree of ${ACC}_{i}$
	$l_{i, j}$	P2P interconnection from ${ACC}_{i}$ to ${ACC}_{j}$
	$A_{const} / A_{sum}$	Area constraint/Optimized area of the SoC
	$L_{const} / L_{sum}$	Constraint of/Optimized P2P wire length
	$T_{sum}$	Execution delay of the SoC

Benchmark	Optimized variables	Delay (cycle)		Area (mm $^{2}$ )
Benchmark	Optimized variables	Model	RTL	Model	RTL
ECG	Non-optimized	$1.69 \times 10^{3}$	$1.65 \times 10^{3}$	0.871	0.892
ECG	$p_{1} = 2, l_{1, 2} = 1$	$1.53 \times 10^{3}$	$1.49 \times 10^{3}$	0.907	0.943
Media	Non-optimized	$4.48 \times 10^{6}$	$4.39 \times 10^{6}$	1.41	1.48
Media	$p_{1} = 2, l_{2, 3} = 1$	$3.59 \times 10^{6}$	$3.51 \times 10^{6}$	1.47	1.55
R2	Non-optimized	$2.15 \times 10^{4}$	$2.09 \times 10^{4}$	1.13	1.16
R2	$p_{2} = 3, l_{1, 4} = 1$	$1.77 \times 10^{4}$	$1.73 \times 10^{4}$	1.27	1.32

Constraint (%)		Running time (s)
Area	P2P	Greedy	SAA	Proposed	Traversal
17	33	0.133	19.2	12.7	103
26	50	0.142	19.2	12.8	785
35	50	0.156	19.2	12.8	$4.49 \times 10^{3}$
44	67	0.173	19.3	12.9	$3.76 \times 10^{4}$
53	83	0.192	19.2	13.0	Failed

Benchmark	Running time (s)
Benchmark	Greedy	SAA	Proposed	Traversal
R1	0.156	19.2	12.8	$4.49 \times 10^{3}$
R2	0.287	21.9	14.7	$6.64 \times 10^{4}$
R3	0.418	23.9	16.3	Failed
ECG	0.201	21.4	14.1	$4.73 \times 10^{4}$
Media	0.372	22.8	15.8	Failed