AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Survey

Data Preparation for Large Language Models

Center for Data Science, Peking University, Beijing 100871, China
Beijing Zhongguancun Academy, Beijing 100871, China
School of Mathematical Science, Peking University, Beijing 100871, China
School of Software and Microelectronics, Peking University, Beijing 100871, China
Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
School of Computer Science, Peking University, Beijing 100871, China
Beijing Key Laboratory of Software and Hardware Cooperative Artificial Intelligence Systems, Beijing 100871, China
Show Author Information

Abstract

Large language models (LLMs) have demonstrated remarkable generalization capabilities across diverse domains, largely attributed to the availability of massive amounts of high-quality training data. Recently, the development paradigm of LLMs has been shifting from a model-centric to a data-centric perspective. In this paper, we provide a comprehensive survey of data preparation algorithms and workflows for LLMs, categorized into three stages: pre-training, continual pre-training, and post-training. We further summarize widely used datasets along with their associated data preparation method, offering a practical reference for researchers who may lack extensive experience in the field of data preparation. Finally, we outline potential directions for future work, highlighting open challenges and opportunities in advancing data preparation for LLMs.

Electronic Supplementary Material

Download File(s)
JCST-2509-15948-Highlights.pdf (2.5 MB)

References

【1】
【1】
 
 
Journal of Computer Science and Technology
Pages 289-317

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Liang H, Wong ZH, Liu R-T, et al. Data Preparation for Large Language Models. Journal of Computer Science and Technology, 2026, 41(1): 289-317. https://doi.org/10.1007/s11390-026-5948-8

217

Views

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Received: 15 September 2025
Accepted: 11 January 2026
Published: 30 April 2026
© Institute of Computing Technology, Chinese Academy of Sciences 2026