Design and Tool Flow of a Reconfigurable Asynchronous Neural Network Accelerator

Jilin Zhang; Hui Wu; Weijia Chen; Shaojun Wei; Hong Chen

doi:10.26599/TST.2020.9010048

| Sign up

PDF (2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (14)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Tables (2)

Table 1

Table 2

Open Access

Design and Tool Flow of a Reconfigurable Asynchronous Neural Network Accelerator

Jilin Zhang, Hui Wu, Weijia Chen, Shaojun Wei, Hong Chen()

Institute of Microelectronics, Tsinghua National Laboratory for Information Science and Technology, and Beijing Engineering Center of Technology and research on Wireless Medical and Health System, Tsinghua University, Beijing 100084, China

Show Author Information

Abstract

Convolutional Neural Networks (CNNs) are widely used in computer vision, natural language processing, and so on, which generally require low power and high efficiency in real applications. Thus, energy efficiency has become a critical indicator of CNN accelerators. Considering that asynchronous circuits have the advantages of low power consumption, high speed, and no clock distribution problems, we design and implement an energy-efficient asynchronous CNN accelerator with a 65 nm Complementary Metal Oxide Semiconductor (CMOS) process. Given the absence of a commercial design tool flow for asynchronous circuits, we develop a novel design flow to implement Click-based asynchronous bundled data circuits efficiently to mask layout with conventional Electronic Design Automation (EDA) tools. We also introduce an adaptive delay matching method and perform accurate static timing analysis for the circuits to ensure correct timing. The accelerator for handwriting recognition network (LeNet-5 model) is implemented. Silicon test results show that the asynchronous accelerator has 30% less power in computing array than the synchronous one and that the energy efficiency of the asynchronous accelerator achieves 1.538 TOPS/W, which is 12% higher than that of the synchronous chip.

Keywords

Convolutional Neural Network (CNN) accelerator asynchronous circuit energy efficiency adaptive delay matching asynchronous design flow

References

[1]

S. X.

Zheng

, P.

Ouyang

, D. D.

Song

, L. D.

Liu

, S. J.

Wei

and S. Y.

Yin

, An ultra-Low power binarized convolutional neural network-based speech recognition processor with on-chip self-learning, IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 12, pp. 4648–4661, 2019.

Google Scholar

[2]

Schneider

, A.

Baevski

, R.

Collobert

, and M.

Auli

, wav2vec: Unsupervised pre-training for speech recognition, arXiv preprint arXiv: 1904.05862, 2019.

Google Scholar

[3]

Howard

, M.

Sandler

, G.

Chu

, L. C.

Chen

, B.

Chen

, M. X.

Tan

, W. J.

Wang

, Y. K.

Zhu

, R. M.

Pang

, V.

Vasudevan

, et al., Searching for MobileNetV3, arXiv preprint arXiv: 1905.02244, 2019.

Google Scholar

[4]

Y. H.

Chen

, T. J.

Yang

, J.

Emer

, and V.

Sze

, Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.

Google Scholar

[5]

Song

, Y.

Cho

, J. S.

Park

, J. W.

Jang

, S.

Lee

, J. H.

Song

, J. G.

Lee

, I.

Kang

, An 11.5 TOPS/W 1024-MAC butterfly structure dual-core sparsity-aware neural processing unit in 8 nm flagship mobile SoC, in Proc. IEEE Int. Solid-State Circuits Conf., San Francisco, CA, USA, 2019, pp. 130–132.

[6]

Lee

, J.

Lee

, D.

Han

, J.

Lee

, G.

Park

, and H. J.

Yoo

, LNPU: A 25.3 TFLOPS/W sparse deep-neural-network learning processor with fine-grained mixed precision of FP8-FP16, in Proc. IEEE Int. Solid-State Circuits Conf., San Francisco, CA, USA, 2019, pp. 142–144.

[7]

Sparsø

and S.

Furber

, Principles of Asynchronous Circuit Design: A Systems Perspective. Boston, MA, USA: Springer, 2001, pp. 3–11.

[8]

P. A.

Beerel

, R. O.

Ozdag

, and M.

Ferretti

, A Designer’s Guide to Asynchronous VLSI. Cambridge, UK: Cambridge University Press, 2010, pp. 7–9.

[9]

van Gageldonk

, K.

van Berkel

, A.

Peeters

, D.

Baumann

, D.

Gloor

, and G.

Stegmann

, An asynchronous low-power 80C51 microcontroller, in Proc. 4th Int. Symp. Advanced Research in Asynchronous Circuits and Systems, San Diego, CA, USA, 1998, pp. 96–107.

[10]

P. A.

Beerel

and M. E.

Roncken

, Low power and energy efficient asynchronous design, Journal of Low Power Electronics, vol. 3, no. 3, pp. 234–253, 2007.

Google Scholar

[11]

I. E.

Sutherland

, Micropipelines, Communications of the ACM, vol. 32, no. 6, pp. 720–738, 1989.

Google Scholar

[12]

Steininger

, V. S.

Veeravalli

, D.

Alexandrescu

, E.

Costenaro

, and L.

Anghel

, Exploring the state dependent SET sensitivity of asynchronous logic – The muller-pipeline example, in Proc. 32nd Int. Conf. Computer Design (ICCD), Seoul, South Korea, 2014, pp. 61–67.

[13]

Akopyan

, J.

Sawada

, A.

Cassidy

, R.

Alvarez-Icaza

, J.

Arthur

, P.

Merolla

, N.

Imam

, Y.

Nakamura

, P.

Datta

, G. J.

Nam

, et al., TrueNorth: Design and tool flow of a 65mW 1 million neuron programmable neurosynaptic chip, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 34, no. 10, pp. 1537–1557, 2015.

Google Scholar

[14]

Davies

, N.

Srinivasa

, T. H.

Lin

, G.

Chinya

, Y. Q.

Cao

, S. H.

Choday

, G.

Dimou

, P.

Joshi

, N.

Imam

, S.

Jain

, et al., Loihi: A neuromorphic manycore processor with on-chip learning, IEEE Micro, vol. 38, no. 1, pp. 82–99, 2018.

Google Scholar

[15]

W. J.

Chen

, H.

, S. J.

Wei

, A. P.

, and H.

Chen

, An asynchronous energy-efficient CNN accelerator with reconfigurable architecture, in Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), Tainan, China, 2018, pp. 51–54.

[16]

Peeters

, F.

te Beest

, M.

de Wit

, and W.

Mallon

, Click elements: An implementation style for data-driven compilation, in Proc. IEEE Symp. Asynchronous Circuits and Systems, Grenoble, France, 2010, pp. 3–14.

[17]

D. E.

Muller

, Theory of asynchronous circuits, http://hdl.handle.net/2027/uiuo.ark:/13960/t7pp0n320.

Tsinghua Science and Technology

Volume 26 Issue 5,
October 2021

Pages 565-573

DOI: 10.26599/TST.2020.9010048

Cite this article:

Zhang J, Wu H, Chen W, et al. Design and Tool Flow of a Reconfigurable Asynchronous Neural Network Accelerator. Tsinghua Science and Technology, 2021, 26(5): 565-573. https://doi.org/10.26599/TST.2020.9010048

Performance	Throughput of continuous input	Delay of intermittent input
Synchronous pipeline	1/D1	2 $\times$ D1
Asynchronous pipeline	1/D1	D1+D2

	Asynchronous	Synchronous
Area (mm $^{2}$ )	1.96 $\times$ 1.96	1.93 $\times$ 1.93
Click element gate	30 177	0
Clock tree buffers	1076	1270
Performance (GOPS)	60.9@150 MHz	48.4@120 MHz
static power (mW)	8.0	7.76
Standby power (mW)	24.0@ (0.8 V, 100 MHz)	28.8@ (0.8 V, 100 MHz)
Power in working mode (mW)	26.4@ (0.8 V, 100 MHz)	29.6@ (0.8 V, 100 MHz)
Energy efficiency (TOPS/W)	1.538	1.37