AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1.6 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset

Xingzhou Liang1,2Xiaochen Zhou3Hui Zou4Yi Lu5Jingjing Qu1( )
Shanghai Artificial Intelligence Laboratory, Shanghai 200030, China
School of International and Public Affairs, Shanghai Jiao Tong University, Shanghai 200030, China
University of Hong Kong, Hong Kong 999077, China
School of Cultural Heritage and Information Management, Shanghai University, Shanghai 200030, China
Department of Informatics, King’s College London, London, WC2R 2LS, UK
Show Author Information

Abstract

In this paper, we propose and implement a systematic pipeline for the automatic classification of AI-related documents extracted from large-scale literature databases. This process results in the creation of an AI-related literature dataset named DeepDiveAI. The dataset construction pipeline integrates expert knowledge with the capabilities of advanced models, structured into two primary stages. In the first stage, expert-curated classification datasets are used to train a Long Short-Term Memory (LSTM) model, which performs coarse-grained classification of AI-related records from large-scale datasets. In the second stage, a large language model, specifically Qwen2.5 Plus, is employed to annotate a random 10% of the initially coarse set of classified AI-related records. These annotated records are subsequently used to train a Bidirectional Encoder Representations from Transformers (BERT) based binary classifier, further refining the coarse set to produce the final DeepDiveAI dataset. Evaluation results indicate that the proposed pipeline achieves both accuracy and efficiency in identifying AI-related literature from large-scale datasets.

References

[1]

J. M. McCarthy, M. Minsky, N. Rochester, and C. E. Shannon, A proposal for the Dartmouth summer research project on artificial intelligence, August 31, 1955, AI Mag., vol. 27, no. 4, pp. 12–14, 2006.

[2]

N. L. Rane, A. Tawde, S. P. Choudhary, and J. Rane, Contribution and performance of ChatGPT and other Large Language Models (LLM) for scientific and research advancements: A double-edged sword, Int. Res. J. Mod. Eng. Technol. Sci., vol. 5, no. 10, pp. 875–899, 2023.

[3]

H. Fan, X. Liu, J. Y. H. Fuh, W. F. Lu, and B. Li, Embodied intelligence in manufacturing: Leveraging large language models for autonomous industrial robotics, J. Intell. Manuf., vol. 36, no. 2, pp. 1141–1157, 2025.

[4]

C. K. Lo, What is the impact of ChatGPT on education? A rapid review of the literature, Educ. Sci., vol. 13, no. 4, p. 410, 2023.

[5]

T. Mirzaei, L. Amini, and P. Esmaeilzadeh, Clinician voices on ethics of LLM integration in healthcare: A thematic analysis of ethical concerns and implications, BMC Med. Inform. Decis. Mak., vol. 24, no. 1, p. 250, 2024.

[6]
E. Musumeci, M. Brienza, V. Suriani, D. Nardi, and D. D. Bloisi, LLM based multi-agent generation of semi-structured documents from semantic templates in the public administration domain, in Artificial Intelligence in HCI, H. Degen and S. Ntoa, eds. Cham, Switzerland: Springer, 2024, pp. 98–117.
[7]
T. Eloundou, S. Manning, P. Mishkin, and D. Rock, GPTs are GPTs: An early look at the labor market impact potential of large language models, arXiv preprint arXiv: 2303.10130, 2023.
[8]
L. Floridi, AI and its new winter: From myths to realities, Philos. Technol., vol. 33, no. 1, pp. 1–3, 2020.
[9]
T. Gonsalves, The Summers and Winters of Artificial Intelligence. Hershey, PA, USA: IGI Global, 2019.
[10]
A. Toosi, A. G. Bottino, B. Saboury, E. Siegel, and A. Rahmim, A brief history of AI: How to prevent another winter (a critical review), PET Clin., vol. 16, no. 4, pp. 449–469, 2021.
[11]

M. Gentzkow, B. Kelly, and M. Taddy, Text as data, J. Econ. Lit., vol. 57, no. 3, pp. 535–574, 2019.

[12]
Z. Zhu, J. Cao, T. Zhou, H. Min, and B. Liu, Understanding user topic preferences across multiple social networks, in Proc. IEEE Int. Conf. Big Data (Big Data), Orlando, FL, USA, 2021, pp. 590–599.
[13]

D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.

[14]
J. Qu, L. Chen, H. Zou, H. Hui, W. Zheng, J. D. Luo, Q. Gong, Y. Zhang, T. Wen, and Y. Chen, Joint-sensemaking, innovation, and communication management during crisis: Evidence from the DCT applications in China, Big Data Soc., vol. 11, no. 3, p. 20539517241270714, 2024.
[15]

F. Shi and J. Evans, Surprising combinations of research contents and contexts are related to impact and emerge with scientific outsiders from distant disciplines, Nat. Commun., vol. 14, no. 1, p. 1641, 2023.

[16]

S. L. Andresen, John McCarthy: Father of AI, IEEE Intell. Syst., vol. 17, no. 5, pp. 84–85, 2002.

[17]

J. McCarthy, From here to human-level AI, Artificial Intelligence, vol. 171, no. 18, pp. 1174–1182, 2007.

[18]
OECD, Artificial intelligence in society, https://www.oecd.org/en/publications/artificial-intelligence-in-society_eedfee77-en.html, 2019.
[19]
A. Agrawal, J. Gans, and A. Goldfarb, Prediction Machines: The Simple Economics of Artificial Intelligence. Cambridge, MA, USA: Harvard Business Review Press, 2022.
[20]

M. Miric, N. Jia, and K. G. Huang, Using supervised machine learning for large-scale classification in management research: The case for identifying artificial intelligence patents, Strateg. Manag. J., vol. 44, no. 2, pp. 491–519, 2023.

[21]

Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp. 436–444, 2015.

[22]
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, arXiv preprint arXiv: 1310.4546, 2013.
[23]
E. Sachini, K. Sioumalas-Christodoulou, S. Christopoulos, and N. Karampekios, AI for AI: Using AI methods for classifying AI science documents, Quant. Sci. Stud., vol. 3, no. 4, pp. 1119–1132, 2022.
[24]

R. Klavans and K. W. Boyack, Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge, J. Assoc. Inf. Sci. Technol., vol. 68, no. 4, pp. 984–998, 2017.

[25]

F. Shu, C. A. Julien, L. Zhang, J. Qiu, J. Zhang, and V. Larivière, Comparing journal and paper level classifications of science, J. Informetr., vol. 13, no. 1, pp. 202–225, 2019.

[26]
arXiv, Category taxonomy, https://arxiv.org/category_taxonomy, 2024.
[27]

V. Devedzic, Identity of AI, Discover Artificial Intelligence, vol. 2, no. 1, p. 23, 2022.

[28]

L. Caluori, Hey Alexa, why are you called intelligent? An empirical investigation on definitions of AI, AI SOCIETY, vol. 39, no. 4, pp. 1905–1919, 2024.

[29]
W. Glänzel and A. Schubert, A new classification scheme of science fields and subfields designed for scientometric evaluation purposes, Scientometrics, vol. 56, no. 3, pp. 357–367, 2003.
[30]

Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti, Similarity-based classification: Concepts and algorithms, Journal of Machine Learning Research, vol. 10, pp. 747–776, 2009.

[31]
S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao, Deep learning based text classification: A comprehensive review, arXiv preprint arXiv: 2004.03705v2, 2020.
[32]

J. Schmidhuber, Deep learning in neural networks: An overview, Neural Netw., vol. 61, pp. 85–117, 2015.

[33]
K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber, LSTM: A search space odyssey, IEEE Trans. Neural Netw. Learning Syst., vol. 28, no. 10, pp. 2222–2232.
[34]
S. Xu, J. Xu, S. Yu, and B. Li, Identifying disinformation from online social media via dynamic modeling across propagation stages, in Proc. 33rd ACM Int. Conf. Information and Knowledge Management, Boise, ID, USA, 2024, pp. 2712–2721.
[35]
A. Toney-Wails, C. Schoeberl, and J. Dunham, AI on AI: Exploring the utility of GPT as an expert annotator of AI publications, arXiv preprint arXiv: 2403.09097, 2024.
[36]
Z. Wang, Y. Pang, and Y. Lin, Large language models are zero-shot text classifiers, arXiv preprint arXiv: 2312.01044, 2023.
[37]
A. Ezen-Can, A comparison of LSTM and BERT for small corpus, arXiv preprint arXiv: 2009.05451, 2020.
[38]

F. A. Acheampong, H. Nunoo-Mensah, and W. Chen, Transformer models for text-based emotion detection: A review of BERT-based approaches, Artif. Intell. Rev., vol. 54, no. 8, pp. 5789–5829, 2021.

[39]

E. C. Garrido-Merchan, R. Gozalo-Brizuela, and S. Gonzalez-Carvajal, Comparing BERT against traditional machine learning models in text classification, J. Comput. Cogn. Eng., vol. 2, no. 4, pp. 352–356, 2023.

[40]
A. Esmaeilzadeh and K. Taghva, Text classification using neural network language model (NNLM) and BERT: An empirical comparison, in Intelligent Systems and Applications, K. Arai, ed. Cham, Switzerland: Springer, 2022, pp. 175–189.
[41]
E. Boitel, A. Mohasseb, and E. Haig, A comparative analysis of GPT-3 and BERT models for text-based emotion recognition: Performance, efficiency, and robustness, in Advances in Computational Intelligence Systems, N. Naik, P. Jenkins, P. Grace, L. Yang, and S. Prajapat, eds. Cham, Switzerland: Springer, 2024, pp. 567–579.
[42]
X. Wu, H. Zou, Y. Xing, J. Qu, W. Guo, and X. Fu, Intelligent innovation dataset on scientific research outcomes and patents, Journal of Social Computing, vol. 6, no. 1, pp. 63–73, 2025.
[43]
B. Kaur, A. Garg, H. Alchilibi, L. H. A. Fezaa, R. Kaur, and B. Goyal, Performance analysis of terrain classifiers using different packages, in Advances in Data and Information Sciences, S. Tiwari, M. C. Trivedi, M. L. Kolhe, and B. K. Singh, eds. Singapore: Springer, 2024, pp. 517–532.
[44]
scikit-learn_developers, CountVectorizer, https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html, 2024.
[45]
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al., Qwen technical report, arXiv preprint arXiv: 2309.16609, 2023.
[46]
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2018.
[47]
China Computer Federation, The list of international academic conferences and periodicals recommended by CCF (2019), https://www.ccf.org.cn/en/Bulletin/2019-05-13/663884.shtml, 2019.
Journal of Social Computing
Pages 158-169
Cite this article:
Liang X, Zhou X, Zou H, et al. DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset. Journal of Social Computing, 2025, 6(2): 158-169. https://doi.org/10.23919/JSC.2025.0007

59

Views

14

Downloads

0

Crossref

0

Scopus

Altmetrics

Received: 19 September 2024
Revised: 22 April 2025
Accepted: 18 May 2025
Published: 30 June 2025
© The author(s) 2025.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return