Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Transforming complex problems, such as transforming ordinary differential equations (ODEs) into matrix formats, into simpler computational tasks is key for AI advancements and paves the way for more efficient computing architectures. Systolic Arrays, known for their computational efficiency, low power use and ease of implementation, address AI’s computational challenges. They are central to mainstream industry AI accelerators, with improvements to the Processing Element (PE) significantly boosting systolic array performance, and also streamlines computing architectures, paving the way for more efficient solutions in technology fields. This research presents a novel PE design and its integration of systolic array based on a novel computing theory - bit-level mathematics for Multiply-Accumulate (MAC) operation. We present 3 different architectures for the PE and provide a comprehensive comparison between them and the state-of-the-art technologies, focusing on power, area, and throughput. This research also demonstrates the integration of the proposed MAC unit design with systolic arrays, highlighting significant improvements in computational efficiency. Our implementations show a 2380952.38 times lower latency, yet 64.19 times less DSP48E1, 1.26 times less Look-Up Tables (LUTs), 10.76 times less Flip-Flops (FFs), with 99.63 times less power consumption and 15.19 times higher performance per PE compared to the state-of-the-art design.
Z. Al-Qadi and M. Aqel, Performance analysis of parallel matrix multiplication algorithms used in image processing, World Appl. Sci. J., vol. 6, no. 1, pp. 45–52, 2009.
Y. Zhao, W. C. J. Gavin, T. Deng, E. A. Ball, and L. Seed, A scalable and accurate chessboard-based AMC algorithm with low computing demands, IEEE Access, vol. 11, pp. 120955–120962, 2023.
S. Yu, T. Bunnam, S. Triamlumlerd, M. Pracha, F. Xia, R. Shafik, and A. Yakovlev, Energy-efficient neural network design using memristive MAC unit, Front. Electron., vol. 3, p. 877629, 2022.
J. Garland and D. Gregg, Low complexity multiply-accumulate units for convolutional neural networks with weight-sharing, ACM Trans. Archit. Code Optim., vol. 15, no. 3, p. 31, 2018.
G. Rong, Y. Xu, X. Tong, and H. Fan, An edge-cloud collaborative computing platform for building AIoT applications efficiently, J. Cloud Comput., vol. 10, p. 36, 2021.
X. Ge, Ultra-reliable low-latency communications in autonomous vehicular networks, IEEE Trans. Veh. Technol., vol. 68, no. 5, pp. 5005–5016, 2019.
K. Yang, H. Liu, Y. Zhao, and T. Deng, A new design approach of hardware implementation through natural language entry, IET Collab. Intell. Manuf., vol. 5, no. 4, p. e12087, 2023.
X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji, High-performance FPGA-based CNN accelerator with block-floating-point arithmetic, IEEE Trans. Very Large Scale Integr, no. VLSI, pp. 1874–1885, 2019.
137
Views
13
Downloads
0
Crossref
0
Web of Science
0
Scopus
0
CSCD
Altmetrics
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).