AI Computing Systems for Large Language Models Training

Zhen-Xing Zhang; Yuan-Bo Wen; Han-Qi Lyu; Chang Liu; Rui Zhang; Xia-Qing Li; Chao Wang; Zi-Dong Du; Qi Guo; Ling Li; Xue-Hai Zhou; Yun-Ji Chen

doi:10.1007/s11390-024-4178-1

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Survey

AI Computing Systems for Large Language Models Training

Zhen-Xing Zhang^{¹^,²}, Yuan-Bo Wen^², Han-Qi Lyu^{¹^,²^,³}, Chang Liu^³, Rui Zhang^², Xia-Qing Li^², Chao Wang^¹, Zi-Dong Du^{²^,⁴}, Qi Guo^², Ling Li^⁵, Xue-Hai Zhou^¹, Yun-Ji Chen^{²^,⁶}(

)

1School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China

2State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China

3Cambricon Technologies, Beijing 100191, China

4Shanghai Innovation Center for Processor Technologies, Shanghai 201210, China

5Intelligent Software Research Center, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China

6University of Chinese Academy of Sciences, Beijing 101408, China

Show Author Information

Abstract

In this paper, we present a comprehensive overview of artificial intelligence (AI) computing systems for large language models (LLMs) training. The rapid advancement of LLMs in recent years, coupled with the widespread adoption of algorithms and applications such as BERT, ChatGPT, and DeepSeek, has sparked significant interest in this field. We classify LLMs into encoder-only, encoder-decoder, and decoder-only models, and briefly analyze their training and inference processes to emphasize their substantial need for computational resources. These operations depend heavily on AI-specific accelerators like GPUs (graphics processing units), TPUs (tensor processing units), and MLUs (machine learning units). However, as the gap widens between the increasing complexity of LLMs and the current capabilities of accelerators, it becomes essential to adopt heterogeneous computing systems optimized for distributed environments to manage the growing computational and memory requirements of LLMs. We delve into the execution and scheduling of LLM algorithms, underlining the critical role of distributed computing strategies, memory management enhancements, and boosting computational efficiency. This paper clarifies the complex relationship between algorithm design, hardware infrastructure, and software optimization, and provides an in-depth understanding of both the software and hardware infrastructure supporting LLMs training, offering insights into the challenges and potential avenues for future development and deployment.

Keywords

artificial intelligence (AI) chip large language model (LLM)AI computing system accelerator

Electronic Supplementary Material

Video

JCST-2402-14178-Video.mp4

Download File(s)

JCST-2402-14178-Highlights.pdf (180 KB)

References

【1】

Crossref Google Scholar

Journal of Computer Science and Technology

Volume 40 Issue 1,
January 2025

Pages 6-41

DOI: 10.1007/s11390-024-4178-1

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Zhang Z-X, Wen Y-B, Lyu H-Q, et al. AI Computing Systems for Large Language Models Training. Journal of Computer Science and Technology, 2025, 40(1): 6-41. https://doi.org/10.1007/s11390-024-4178-1

1846

Views

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 08 February 2024

Accepted: 05 January 2025

Published: 23 February 2025