AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (896.8 KB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

A Novel Parallel Processing Element Architecture for Accelerating ODE and AI

Department of Electronic and Electrical Engineering, The University of Sheffield, Sheffield S13JD, United Kingdom
Show Author Information

Abstract

Transforming complex problems, such as transforming ordinary differential equations (ODEs) into matrix formats, into simpler computational tasks is key for AI advancements and paves the way for more efficient computing architectures. Systolic Arrays, known for their computational efficiency, low power use and ease of implementation, address AI’s computational challenges. They are central to mainstream industry AI accelerators, with improvements to the Processing Element (PE) significantly boosting systolic array performance, and also streamlines computing architectures, paving the way for more efficient solutions in technology fields. This research presents a novel PE design and its integration of systolic array based on a novel computing theory - bit-level mathematics for Multiply-Accumulate (MAC) operation. We present 3 different architectures for the PE and provide a comprehensive comparison between them and the state-of-the-art technologies, focusing on power, area, and throughput. This research also demonstrates the integration of the proposed MAC unit design with systolic arrays, highlighting significant improvements in computational efficiency. Our implementations show a 2380952.38 times lower latency, yet 64.19 times less DSP48E1, 1.26 times less Look-Up Tables (LUTs), 10.76 times less Flip-Flops (FFs), with 99.63 times less power consumption and 15.19 times higher performance per PE compared to the state-of-the-art design.

References

[1]
J. Sohl, J. Wang, and D. Liu, Large matrix multiplication on a novel heterogeneous parallel DSP architecture, in Proc. 8th Int. Symp. Advanced Parallel Processing Technologies, Rapperswil, Switzerland, 2009, pp. 408–419.
[2]
D. Wu, X. Fan, W. Cao, and L. Wang, SWM: A high-performance sparse-Winograd matrix multiplication CNN accelerator, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 29, no. 5, pp. 936–949, 2021.
[3]

Z. Al-Qadi and M. Aqel, Performance analysis of parallel matrix multiplication algorithms used in image processing, World Appl. Sci. J., vol. 6, no. 1, pp. 45–52, 2009.

[4]
J. C. Butcher, Numerical Methods for Ordinary Differential Equations. 3rd ed. Chichester, UK: John Wiley & Sons, 2016.
[5]
F. Siddiqui, S. Amiri, U. I. Minhas, T. Deng, R. Woods, K. Rafferty, and D. Crookes, FPGA-based processor acceleration for image processing applications, J. Imaging, vol. 5, no. 1, p. 16, 2019.
[6]
X. He, S. Pal, A. Amarnath, S. Feng, D. H. Park, A. Rovinski, H. Ye, Y. Chen, R. Dreslinski, and T. Mudge, Sparse-TPU: Adapting systolic arrays for sparse matrices, in Proc. 34th ACM Int. Conf. Supercomputing, Barcelona, Spain, 2020, p. 19.
[7]
X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs, in Proc. 54th ACM/EDAC/IEEE Design Automation Conf., Austin, TX, USA, 2017, pp. 1–6.
[8]
H. Liao, J. Tu, J. Xia, and X. Zhou, DaVinci: A scalable architecture for neural network computing, in Proc. 2019 IEEE Hot Chips 31 Symp., Cupertino, CA, USA, 2019, pp. 1–44.
[9]
W. Z. Tao, Y. Wang, and H. Zhang, Overview of tensor layout in modern neural network accelerator, in Proc. 18th Int. Computer Conf. Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 2021, pp. 368–371.
[10]
A. Lavin and S. Gray, Fast algorithms for convolutional neural networks, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 4013–4021.
[11]
S. Feng, J. Wu, S. Zhou, and R. Li, The implementation of LeNet-5 with NVDLA on RISC-V SoC, in Proc. 10th Int. Conf. Software Engineering and Service Science (ICSESS), Beijing, China, 2019, pp. 39–42.
[12]

Y. Zhao, W. C. J. Gavin, T. Deng, E. A. Ball, and L. Seed, A scalable and accurate chessboard-based AMC algorithm with low computing demands, IEEE Access, vol. 11, pp. 120955–120962, 2023.

[13]

S. Yu, T. Bunnam, S. Triamlumlerd, M. Pracha, F. Xia, R. Shafik, and A. Yakovlev, Energy-efficient neural network design using memristive MAC unit, Front. Electron., vol. 3, p. 877629, 2022.

[14]

J. Garland and D. Gregg, Low complexity multiply-accumulate units for convolutional neural networks with weight-sharing, ACM Trans. Archit. Code Optim., vol. 15, no. 3, p. 31, 2018.

[15]
M. Cho and Y. Kim, FPGA-based convolutional neural network accelerator with resource-optimized approximate multiply-accumulate unit, Electronics, vol. 10, no. 22, p. 2859, 2021.
[16]

G. Rong, Y. Xu, X. Tong, and H. Fan, An edge-cloud collaborative computing platform for building AIoT applications efficiently, J. Cloud Comput., vol. 10, p. 36, 2021.

[17]
B. Cohen, G. Csatári, S. Huang, B. Jones, A. Lebre, D. Paterson, and I. Váncsa, Edge computing: Next steps in architecture, design and testing, https://www.openstack.org/use-cases/edge-computing/edge-computing-next-steps-in-architecture-design-and-testing.
[18]

X. Ge, Ultra-reliable low-latency communications in autonomous vehicular networks, IEEE Trans. Veh. Technol., vol. 68, no. 5, pp. 5005–5016, 2019.

[19]
M. Harris, New Pascal GPUs accelerate inference in the data center, https://developer.nvidia.com/blog/new-pascal-gpus-accelerate-inference-in-the-data-center, 2016.
[20]
T. P. Morgan, Why AI inference will remain largely on the CPU, https://www.nextplatform.com/2023/Why-AI-Inference-Will-Remain-Largely-On-The-CPU, 2023.
[21]
ASUS Global, Coral M.2 accelerator B+M key, https://www.asus.com/networking-iot-servers/AIoT-Industrial-Solutions/gpu-edge-ai-accelerators/Coral-M-2-Accelerator-B-M-key, 2024.
[22]
D-Central, ASIC vs FPGA vs GPU vs CPU: understanding the differences, https://d-central.tech/asic-vs-fpga-vs-gpu-vs-cpu-understanding-the-differences, 2022.
[23]
K. Anderson, Hardware for machine learning inference: CPUs, GPUs, TPUs, https://telnyx.com/resources/hardware-for-machine-learning-inference-cpus-gpus-tpus, 2023.
[24]
H. T. Kung and C. E. Leiserson, Systolic arrays (for VLSI), in Proc. Sparse Matrix Proc. 1978, Philadelphia, PA, USA, 1979, pp. 256–282.
[25]
R. Sun, Y. Ni, X. He, J. Zhao, and A. Zou, ONE-SA: Enabling nonlinear operations in systolic arrays for efficient and flexible neural network inference, in Proc. 2024 Design, Automation & Test in Europe Conf. & Exhibition, Valencia, Spain, 2024, pp. 1–6.
[26]

K. Yang, H. Liu, Y. Zhao, and T. Deng, A new design approach of hardware implementation through natural language entry, IET Collab. Intell. Manuf., vol. 5, no. 4, p. e12087, 2023.

[27]
Y. Cao, X. Wei, T. Qiao, and H. Chen, FPGA-based accelerator for convolution operations, in Proc. 2019 IEEE Int. Conf. Signal, Information and Data Processing (ICSIDP), Chongqing, China, 2019, pp. 1–5.
[28]
H. Chen, J. Zhang, Y. Du, S. Xiang, Z. Yue, N. Zhang, Y. Cai, and Z. Zhang, Understanding the potential of FPGA-based spatial acceleration for large language model inference, arXiv preprint arXiv: 2312.15159, 2023.
[29]
P. Xue, L. Pan, L. Sun, and M. Huang, Dual-line-systolic array for high performance CNN accelerator, in Proc. 2022 IEEE 30th Annu. Int. Symposium on Field-Programmable Custom Computing Machines (FCCM), New York, NY, USA, 2022, p. 1.
[30]
X. Chen, J. Li, and Y. Zhao, Hardware resource and computational density efficient CNN accelerator design based on FPGA, in Proc. 2021 IEEE Int. Conf. Integrated Circuits, Technologies and Applications (ICTA), Zhuhai, China, 2021, pp. 204–205.
[31]

X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji, High-performance FPGA-based CNN accelerator with block-floating-point arithmetic, IEEE Trans. Very Large Scale Integr, no. VLSI, pp. 1874–1885, 2019.

Tsinghua Science and Technology
Pages 1954-1964
Cite this article:
Yang K, Liu L, Liu H, et al. A Novel Parallel Processing Element Architecture for Accelerating ODE and AI. Tsinghua Science and Technology, 2025, 30(5): 1954-1964. https://doi.org/10.26599/TST.2024.9010090

137

Views

13

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 30 March 2024
Revised: 02 May 2024
Accepted: 09 May 2024
Published: 29 April 2025
© The Author(s) 2025.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return