Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Since OpenAI opened access to ChatGPT, large language models (LLMs) become an increasingly popular topic attracting researchers’ attention from abundant domains. However, public researchers meet some problems when developing LLMs given that most of the LLMs are produced by industries and the training details are typically unrevealed. Since datasets are an important setup of LLMs, this paper does a holistic survey on the training datasets used in both the pre-train and fine-tune processes. The paper first summarizes 16 pre-train datasets and 16 fine-tune datasets used in the state-of-the-art LLMs. Secondly, based on the properties of the pre-train and fine-tune processes, it comments on pre-train datasets from quality, quantity, and relation with models, and comments on fine-tune datasets from quality, quantity, and concerns. This study then critically figures out the problems and research trends that exist in current LLM datasets. The study helps public researchers train and investigate LLMs by visual cases and provides useful comments to the research community regarding data development. To the best of our knowledge, this paper is the first to summarize and discuss datasets used in both autoregressive and chat LLMs. The survey offers insights and suggestions to researchers and LLM developers as they build their models, and contributes to the LLM study by pointing out the existing problems of LLM studies from the perspective of data.
Kasneci E, Sessler K, Küchemann S et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 2023, 103: 102274. DOI: 10.1016/j.lindif.2023. 102274.
Wang B Y, Xie Q Q, Pei J H, Chen Z H, Tiwari P, Li Z, Fu J. Pre-trained language models in biomedical domain: A systematic survey. ACM Computing Surveys, 2024, 56(3): 55. DOI: 10.1145/3611651.
Chang Y P, Wang X, Wang J D et al. A survey on evaluation of large language models. ACM Trans. Intelligent Systems and Technology, 2024, 15(3): 39. DOI: 10.1145/3641289.
Fernandes P, Madaan A, Liu E et al. Bridging the gap: A survey on integrating (Human) feedback for natural language generation. Trans. Association for Computational Linguistics, 2023, 11: 1643–1668. DOI: 10.1162/tacl_a_00626.
De Angelis L, Baglivo F, Arzilli G, Privitera G P, Ferragina P, Tozzi A E, Rizzo C. ChatGPT and the rise of large language models: The new AI-driven infodemic threat in public health. Frontiers in Public Health, 2023, 11: 1166120. DOI: 10.3389/fpubh.2023.1166120.
Frank M C. Bridging the data gap between children and large language models. Trends in Cognitive Sciences, 2023, 27(11): 990–992. DOI: 10.1016/j.tics.2023.08.007.
Chowdhery A, Narang S, Devlin J et al. PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24(1): 240. DOI: 10.5555/3648699.3648939.
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 140.
Yuan S, Zhao H Y, Du Z X, Ding M, Liu X, Cen Y K, Zou X, Yang Z L, Tang J. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models. AI Open, 2021, 2: 65–68. DOI: 10.1016/j.aiopen.2021.06.001.
Zhu Q, Huang K L, Zhang Z, Zhu X Y, Huang M L. CrossWOZ: A large-scale Chinese cross-domain task-oriented dialogue dataset. Trans. Association for Computational Linguistics, 2020, 8: 281–295. DOI: 10.1162/tacl_a_ 00314.
Trinh T H, Wu Y H, Le Q V, He H, Luong T. Solving olympiad geometry without human demonstrations. Nature, 2024, 625(7995): 476–482. DOI: 10.1038/s41586-023- 06747-5.
Kwiatkowski T, Palomaki J, Redfield O et al. Natural questions: A benchmark for question answering research. Trans. Association for Computational Linguistics, 2019, 7: 453–466. DOI: 10.1162/tacl_a_00276.
Zhang J, Wu X D, Sheng V S. Learning from crowdsourced labeled data: A survey. Artificial Intelligence Review, 2016, 46(4): 543–576. DOI: 10.1007/s10462–016-9491–9.
Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X D, Naumann T, Gao J F, Poon H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Computing for Healthcare, 2022, 3(1): 2. DOI: 10.1145/3458754.
Rillig M C, Ågerstrand M, Bi M, Gould K A, Sauerland U. Risks and benefits of large language models for the environment. Environmental Science & Technology, 2023, 57(9): 3464–3466. DOI: 10.1021/acs.est.3c01106.