Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
In this paper, we propose and implement a systematic pipeline for the automatic classification of AI-related documents extracted from large-scale literature databases. This process results in the creation of an AI-related literature dataset named DeepDiveAI. The dataset construction pipeline integrates expert knowledge with the capabilities of advanced models, structured into two primary stages. In the first stage, expert-curated classification datasets are used to train a Long Short-Term Memory (LSTM) model, which performs coarse-grained classification of AI-related records from large-scale datasets. In the second stage, a large language model, specifically Qwen2.5 Plus, is employed to annotate a random 10% of the initially coarse set of classified AI-related records. These annotated records are subsequently used to train a Bidirectional Encoder Representations from Transformers (BERT) based binary classifier, further refining the coarse set to produce the final DeepDiveAI dataset. Evaluation results indicate that the proposed pipeline achieves both accuracy and efficiency in identifying AI-related literature from large-scale datasets.
J. M. McCarthy, M. Minsky, N. Rochester, and C. E. Shannon, A proposal for the Dartmouth summer research project on artificial intelligence, August 31, 1955, AI Mag., vol. 27, no. 4, pp. 12–14, 2006.
N. L. Rane, A. Tawde, S. P. Choudhary, and J. Rane, Contribution and performance of ChatGPT and other Large Language Models (LLM) for scientific and research advancements: A double-edged sword, Int. Res. J. Mod. Eng. Technol. Sci., vol. 5, no. 10, pp. 875–899, 2023.
H. Fan, X. Liu, J. Y. H. Fuh, W. F. Lu, and B. Li, Embodied intelligence in manufacturing: Leveraging large language models for autonomous industrial robotics, J. Intell. Manuf., vol. 36, no. 2, pp. 1141–1157, 2025.
C. K. Lo, What is the impact of ChatGPT on education? A rapid review of the literature, Educ. Sci., vol. 13, no. 4, p. 410, 2023.
T. Mirzaei, L. Amini, and P. Esmaeilzadeh, Clinician voices on ethics of LLM integration in healthcare: A thematic analysis of ethical concerns and implications, BMC Med. Inform. Decis. Mak., vol. 24, no. 1, p. 250, 2024.
M. Gentzkow, B. Kelly, and M. Taddy, Text as data, J. Econ. Lit., vol. 57, no. 3, pp. 535–574, 2019.
D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
F. Shi and J. Evans, Surprising combinations of research contents and contexts are related to impact and emerge with scientific outsiders from distant disciplines, Nat. Commun., vol. 14, no. 1, p. 1641, 2023.
S. L. Andresen, John McCarthy: Father of AI, IEEE Intell. Syst., vol. 17, no. 5, pp. 84–85, 2002.
J. McCarthy, From here to human-level AI, Artificial Intelligence, vol. 171, no. 18, pp. 1174–1182, 2007.
M. Miric, N. Jia, and K. G. Huang, Using supervised machine learning for large-scale classification in management research: The case for identifying artificial intelligence patents, Strateg. Manag. J., vol. 44, no. 2, pp. 491–519, 2023.
Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp. 436–444, 2015.
R. Klavans and K. W. Boyack, Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge, J. Assoc. Inf. Sci. Technol., vol. 68, no. 4, pp. 984–998, 2017.
F. Shu, C. A. Julien, L. Zhang, J. Qiu, J. Zhang, and V. Larivière, Comparing journal and paper level classifications of science, J. Informetr., vol. 13, no. 1, pp. 202–225, 2019.
V. Devedzic, Identity of AI, Discover Artificial Intelligence, vol. 2, no. 1, p. 23, 2022.
L. Caluori, Hey Alexa, why are you called intelligent? An empirical investigation on definitions of AI, AI SOCIETY, vol. 39, no. 4, pp. 1905–1919, 2024.
Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti, Similarity-based classification: Concepts and algorithms, Journal of Machine Learning Research, vol. 10, pp. 747–776, 2009.
J. Schmidhuber, Deep learning in neural networks: An overview, Neural Netw., vol. 61, pp. 85–117, 2015.
F. A. Acheampong, H. Nunoo-Mensah, and W. Chen, Transformer models for text-based emotion detection: A review of BERT-based approaches, Artif. Intell. Rev., vol. 54, no. 8, pp. 5789–5829, 2021.
E. C. Garrido-Merchan, R. Gozalo-Brizuela, and S. Gonzalez-Carvajal, Comparing BERT against traditional machine learning models in text classification, J. Comput. Cogn. Eng., vol. 2, no. 4, pp. 352–356, 2023.
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).