Journal Home > Volume 20 , Issue 6

As performance requirements for bus-based embedded System-on-Chips (SoCs) increase, more and more on-chip application-specific hardware accelerators (e.g., filters, FFTs, JPEG encoders, GSMs, and AES encoders) are being integrated into their designs. These accelerators require system-level tradeoffs among performance, area, and scalability. Accelerator parallelization and Point-to-Point (P2P) interconnect insertion are two effective system-level adjustments. The former helps to boost the computing performance at the cost of area, while the latter provides higher bandwidth at the cost of routability. What's more, they interact with each other. This paper proposes a design flow to optimize accelerator parallelization and P2P interconnect insertion simultaneously. To explore the huge optimization space, we develop an effective algorithm, whose goal is to reduce total SoC latency under the constraints of SoC area and total P2P wire length. Experimental results show that the performance difference between our proposed algorithm and the optimal results is only 2.33% on average, while the running time of the algorithm is less than 17 s.


menu
Abstract
Full text
Outline
About this article

Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion for Bus-Based Embedded SoCs

Show Author's information Daming ZhangYongpan Liu( )Shuangchen LiTongda WuHuazhong Yang
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China.
Department of Electronic and Computer Engineering, University of California, Santa Barbara, CA 93106, USA.

Abstract

As performance requirements for bus-based embedded System-on-Chips (SoCs) increase, more and more on-chip application-specific hardware accelerators (e.g., filters, FFTs, JPEG encoders, GSMs, and AES encoders) are being integrated into their designs. These accelerators require system-level tradeoffs among performance, area, and scalability. Accelerator parallelization and Point-to-Point (P2P) interconnect insertion are two effective system-level adjustments. The former helps to boost the computing performance at the cost of area, while the latter provides higher bandwidth at the cost of routability. What's more, they interact with each other. This paper proposes a design flow to optimize accelerator parallelization and P2P interconnect insertion simultaneously. To explore the huge optimization space, we develop an effective algorithm, whose goal is to reduce total SoC latency under the constraints of SoC area and total P2P wire length. Experimental results show that the performance difference between our proposed algorithm and the optimal results is only 2.33% on average, while the running time of the algorithm is less than 17 s.

Keywords: accelerator parallelization, point-to-point interconnect insertion, bus-based embedded system-on-chips

References(30)

[1]
Ma P., Liu P., Li K., Zou Y., An A., Wang Y., Hao Y., A parallel low latency bus on chip for packet processing mpsoc, in International Conference on Solid-State and Integrated Circuit Technology (ICSICT), 2010, pp. 545–547.
DOI
[2]
Ahmedy S., Wangy Z., Klaibery M., Ahl S., Roblewskiy M., Simon S., Parallel hardware architecture for jpeg-ls based on domain decomposition, Proc. SPIE, Applications of Digital Image Processing, vol. 8499, no. 14, pp. 1–11, 2012.
[3]
Sridhara S. R., DiRenzo M., Lingam S., Lee S. J., Blzquez R., Maxey J., Ghanem S., Lee Y. H., Abdallah R., Singh P.et al, Microwatt processor platform for medical system-on-chip applications, IEEE Journal of Solid-State Circuits (JSSC), vol. 46, no. 4, pp. 721–730, 2011.
[4]
Kwong J., Chandrakasan A. P., An energy-efficient biomedical signal processing platform, IEEE Journal of Solid-State Circuits (JSSC), vol. 46, no. 7, pp. 1742–1753, 2011.
[5]
Zhang F., Zhang Y., Silver J., Shakhsheer Y., Nagaraju M., Klinefelter A., Pandey J. N., Boley J., Carlson E. J., Shrivastava A.et al, A batteryless 19w mics/ism-band energy harvesting body area sensor node soc, in IEEE International Solid-state Circuits Conference (ISSCC), 2012, pp. 298–300.
DOI
[6]
Goulding-Hotta N., Sampson J., Zheng Q., Bhatt V., Auricchio J., Swanson S., Taylor M. B., Greendroid: An architecture for the dark silicon age, in Asia and South Pacific Design Automation Conference (ASP-DAC), 2012, pp. 100–105.
DOI
[7]
Corvino R., Diken E., Gamatie A., Jozwiak L., Transformation-based exploration of data parallel architecture for customizable hardware: A jpeg encoder case study, in Euromicro Conference on Digital System Design (DSD), 2012, pp. 774–781.
DOI
[8]
Haris J., Sri P., Synthesis of heterogeneous pipelined multiprocessor systems using ilp: Jpeg case study, in International Conference on Hardware-Software Codesign and System Synthesis (CODES+ISSS), 2008, pp. 1–6.
[9]
Belhadj N., Bahri N., Ayed M. B., Marrakchi Z., Mehrez H., Data level parallelism for h264/avc baseline intra-prediction chain on mpsoc, in Multi-Conference on Systems, Signals and Devices (SSD), 2013, pp. 1–4.
DOI
[10]
Hagiescu A., Wong W. F., Bacon D. F., Rabbah R., A computing origami: Folding streams in fpgas, in Design Automation Conference (DAC), 2009, pp. 282–287.
DOI
[11]
Li S., Liu Y., Hu X., He X., Zhang Y., Zhang P., Yang H., Optimal partition with block-level parallelization in c-to-rtl synthesis for streaming applications, in Asia and South Pacific Design Automation Conference (ASP-DAC), 2013, pp. 225–230.
[12]
Zuo W., Liang Y., Li P., Rupnow K., Chen D., Cong J., Improving high level synthesis optimization opportunity through polyhedral transformations, in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2013, pp. 92–97.
DOI
[13]
Vainbrand D., Ginosar R., Network-on-chip architectures for neural networks, in International Symposium on Networks-on-chip (NOCS), 2007, pp. 135–144.
[14]
Bertozzi D., Jalabert A., Murali S., Tamhankar R., Stergiou S., Benini L., Micheli G. D., Noc synthesis flow for customized domain specific multiprocessor systems-on-chip, IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 16, no. 2, pp. 113–129, 2005.
[15]
Lee H. G., Ogras U. Y., Marculescu R., Chang N., Design space exploration and prototyping for on-chip multimedia applications, in Design Automation Conference (DAC), 2006, pp. 137–142.
DOI
[16]
Gladigau J., Gerstlauer A., Haubelt C., Streubhr M., Teich J., A system-level synthesis approach from formal application models to generic bus-based mpsocs, in International Conference on Embedded Computer Systems (SAMOS), 2010, pp. 118–125.
DOI
[17]
Hempstead M., Wei G. Y., Brooks D., An accelerator-based wireless sensor network processor in 130 nm cmos, IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), vol. 1, no. 2, pp. 193–202, 2011.
[18]
Zahir R., Ewert M., Seshadri H., The medfield smartphone: Intel architecture in a handheld form factor, IEEE Micro, vol. 33, no. 6, pp. 38–46, 2013.
[19]
Rose B., Samsung's 8-core exynos 5 octa processor: Your next phone will be fast, http://gizmodo.com/5974528/samsungs-new-exynos-processor-just-went-octa, 2013.
[20]
Hauser P., Olivier H., Connected device platform, Patent US20130303087A1, Nov. 14, 2013.
[21]
Bassam R., Toni M., Home automation system: A cheap and open-source alternative to control household appliances, http://www.diva-portal.org/smash/get/diva2:679674/FULLTEXT01.pdf, 2013.
[22]
Lee H. G., Chang N., Ogras U. Y., Marculescu R., On-chip communication architecture exploration: A quantitative evaluation of point-to-point, bus, and network-on-chip approaches, ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 12, no. 3, 2007.
[23]
Pasricha S., Dutt N., Ben-Romdhane M., Constraint-driven bus matrix synthesis for mpsoc, in Asia and South Pacific Design Automation Conference (ASP-DAC), 2006, pp. 30–35.
DOI
[24]
Tan S., Qiao F., Xia B., Yang H., Wang H., A functional model of systemc-based mpeg-2 decoder with heterogeneous multi-ip-cores and hybrid-interconnections architecture, in International Congress on Image and Signal Processing (CISP), 2009, pp. 1–5.
DOI
[25]
Pham-Quoc C., Heisswolf J., Werner S., Al-Ars Z., Becker J., Bertels K., Hybrid interconnect design for heterogeneous hardware accelerators, in Design, Automation and Test in Europe Conference and Exhibition (DATE), 2013, pp. 843–846.
DOI
[26]
Vainbrand D., Ginosar R., Network-on-chip architectures for neural networks, in Symposium on Networks-on-Chip (NOCS), 2010, pp. 135–144.
DOI
[27]
Zhu W., Liu L., Yin S., Dong Y., Wei S., Tang E. Y., Song J., Peng J., A 65 nm uneven-dual-core soc based platform for multi-device collaborative computing, in International Symposium on Circuits and Systems (ISCAS), 2014, pp. 2527–2530.
DOI
[28]
Wei Y., Sze C., Viswanathan N., Li Z., Alpert C. J., Reddy L., Huber A. D., Tellez G.E., Keller D., Sapatnekar S. S., Glare: Global and local wiring aware routability evaluation, in Design Automation Conference (DAC), 2012, pp. 768–773.
DOI
[29]
MIT, 48 half-hour excerpts of two-channel ambulatory ecg recordings, http://www.physionet.org/physiobank/database/mitdb/, 2013.
[30]
Zhang Y., Image Engineering (I) Image Processing (2nd ed), Beijing, China: Tsinghua University Press, 2009.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 08 November 2015
Accepted: 16 November 2015
Published: 17 December 2015
Issue date: December 2015

Copyright

© The author(s) 2015

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (No. 61271269), the National High-Tech Research and Development (863) Program (No. 2013AA01320), and the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions (No. YETP0102).

Rights and permissions

Return