Data Preparation for Large Language Models

Hao Liang; Zhen Hao Wong; Rui-Tong Liu; Yu-Han Wang; Mei-Yi Qiang; Zheng-Yang Zhao; Cheng-Yu Shen; Cong-Hui He; Wen-Tao Zhang; Bin Cui

doi:10.1007/s11390-026-5948-8

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Survey

Data Preparation for Large Language Models

Hao Liang^{¹^,²}, Zhen Hao Wong^³, Rui-Tong Liu^³, Yu-Han Wang^³, Mei-Yi Qiang^⁴, Zheng-Yang Zhao^¹, Cheng-Yu Shen^⁴, Cong-Hui He^⁵, Wen-Tao Zhang^{¹^,²}(

), Bin Cui^{⁶^,⁷}(

)

1Center for Data Science, Peking University, Beijing 100871, China

2Beijing Zhongguancun Academy, Beijing 100871, China

3School of Mathematical Science, Peking University, Beijing 100871, China

4School of Software and Microelectronics, Peking University, Beijing 100871, China

5Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China

6School of Computer Science, Peking University, Beijing 100871, China

7Beijing Key Laboratory of Software and Hardware Cooperative Artificial Intelligence Systems, Beijing 100871, China

Show Author Information

Abstract

Large language models (LLMs) have demonstrated remarkable generalization capabilities across diverse domains, largely attributed to the availability of massive amounts of high-quality training data. Recently, the development paradigm of LLMs has been shifting from a model-centric to a data-centric perspective. In this paper, we provide a comprehensive survey of data preparation algorithms and workflows for LLMs, categorized into three stages: pre-training, continual pre-training, and post-training. We further summarize widely used datasets along with their associated data preparation method, offering a practical reference for researchers who may lack extensive experience in the field of data preparation. Finally, we outline potential directions for future work, highlighting open challenges and opportunities in advancing data preparation for LLMs.

Keywords

data-centric artificial intelligence (AI)data management large language model (LLM)

Electronic Supplementary Material

Download File(s)

JCST-2509-15948-Highlights.pdf (2.5 MB)

References

【1】

Crossref Google Scholar

Journal of Computer Science and Technology

Volume 41 Issue 1,
April 2026

Pages 289-317

DOI: 10.1007/s11390-026-5948-8

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Liang H, Wong ZH, Liu R-T, et al. Data Preparation for Large Language Models. Journal of Computer Science and Technology, 2026, 41(1): 289-317. https://doi.org/10.1007/s11390-026-5948-8

217

Views

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 15 September 2025

Accepted: 11 January 2026

Published: 30 April 2026