Journal Home > Volume 5 , Issue 2

Most news topic detection methods use word-based methods, which easily ignore the relationship among words and have semantic sparsity, resulting in low topic detection accuracy. In addition, the current mainstream probability methods and graph analysis methods for topic detection have high time complexity. For these reasons, we present a news topic detection model on the basis of capsule semantic graph (CSG). The keywords that appear in each text at the same time are modeled as a keyword graph, which is divided into multiple subgraphs through community detection. Each subgraph contains a group of closely related keywords. The graph is used as the vertex of CSG. The semantic relationship among the vertices is obtained by calculating the similarity of the average word vector of each vertex. At the same time, the news text is clustered using the incremental clustering method, where each text uses CSG; that is, the similarity among texts is calculated by the graph kernel. The relationship between vertices and edges is also considered when calculating the similarity. Experimental results on three standard datasets show that CSG can obtain higher precision, recall, and F1 values than several latest methods. Experimental results on large-scale news datasets reveal that the time complexity of CSG is lower than that of probabilistic methods and other graph analysis methods.


menu
Abstract
Full text
Outline
About this article

News Topic Detection Based on Capsule Semantic Graph

Show Author's information Shuang YangYan Tang( )
College of Computer and Information Science, Southwest University, Chongqing 400000, China

Abstract

Most news topic detection methods use word-based methods, which easily ignore the relationship among words and have semantic sparsity, resulting in low topic detection accuracy. In addition, the current mainstream probability methods and graph analysis methods for topic detection have high time complexity. For these reasons, we present a news topic detection model on the basis of capsule semantic graph (CSG). The keywords that appear in each text at the same time are modeled as a keyword graph, which is divided into multiple subgraphs through community detection. Each subgraph contains a group of closely related keywords. The graph is used as the vertex of CSG. The semantic relationship among the vertices is obtained by calculating the similarity of the average word vector of each vertex. At the same time, the news text is clustered using the incremental clustering method, where each text uses CSG; that is, the similarity among texts is calculated by the graph kernel. The relationship between vertices and edges is also considered when calculating the similarity. Experimental results on three standard datasets show that CSG can obtain higher precision, recall, and F1 values than several latest methods. Experimental results on large-scale news datasets reveal that the time complexity of CSG is lower than that of probabilistic methods and other graph analysis methods.

Keywords: news topic detection, capsule semantic graph, graph kernel

References(30)

[1]
J. Allan, S. Harding, D. Fisher, A. Bolivar, S. Guzman-Lara, and P. Amstutz, Taking topic detection from evaluation to practice, in Proc. 38th Annu. Hawaii Int. Conf. on System Sciences, Big Island, HI, USA, 2005, p. 101a.
[2]
Y. Chen and L. Liu, Development and research of Topic Detection and Tracking, in 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS), pp. 170-173, 2016.
[3]
L. Hong and B. W. Li, Hot topic detection research of internet public opinion based on affinity propagation clustering, in Computer, Informatics, Cybernetics and Applications: Proceedings of the CICA 2011, X. G. He, E. T. Hua, Y. Lin, and X. Z. Liu, eds. Dordrecht, Netherlands: Springer, 2012, pp. 261-269.
[4]
T. Sakaki, M. Okazaki, and Y. Matsuo, Tweet analysis for real-time event detection and earthquake reporting system development, IEEE Trans. Knowl. Data Eng., vol. 25, no. 4, pp. 919-931, 2013.
[5]
X. F. Lu, X. Zhou, W. T. Wang, P. Lio, and P. Hui, Domain-oriented topic discovery based on features extraction and topic clustering, IEEE Access, vol. 8, pp. 93648-93662, 2020.
[6]
J. Z. Li, Q. N. Fan, and K. Zhang, Keyword extraction based on tf/idf for Chinese news document, Wuhan Univ.J. Nat. Sci., vol. 12, no. 5, pp. 917-921, 2007.
[7]
K. K. Bun and M. Ishizuka, Topic extraction from news archive using TF*PDF algorithm, in Proc. 3rd Int. Conf. on Web Information Systems Engineering, Singapore, 2002, pp. 73-82.
[8]
S. Chen and Z. Jin, Weibo topic detection based on improved TF-IDF algorithm. Science & Technology Review, vol. 34, no. 2, pp. 282-286, 2016.
[9]
R. Mihalcea and P. Tarau, TextRank: Bringing order into text, in Proc. Conf. on Empirical Methods in Natural Language Processing, Barcelona, Spain, 2004, pp. 404-411.
[10]
K. Zhang, J. Zi, and L. G. Wu, New event detection based on indexing-tree and named entity, in Proc. 30th Annu. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Amsterdam, Netherlands, 2007, pp. 215-222.
[11]
M. Pu, F. Zhou, J. J. Zhou, X. Yan, and L. J. Zhou, Topic sentence extraction of key news events based on weighted textrank, (in Chinese), Comput. Eng., vol. 43, no. 8, pp. 219-224, 2017.
[12]
X. T. Qu, J. Yang, B. Wu, and H. M. Xin, A news event detection algorithm based on key elements recognition, in Proc. 2016 IEEE 1st Int. Conf. on Data Science in Cyberspace (DSC), Changsha, China, 2016, pp. 394-399.
[13]
Z. Y. Chen and B. Liu, Mining topics in documents: Standing on the shoulders of big data, in Proc. 20th ACM SIGKDD Int. Conf. on Knowledge Discovery And Data Mining, New York, NY, USA, 2014, pp. 1116-1125.
[14]
L. Q. Qiu, H. Y. Liu, X. Fan, and W. Jia, Hot topic detection based on VSM and improved LDA hybrid model, in Proc. 12th Int. Conf. on Genetic and Evolutionary Computing, Changzhou, China, 2019, pp. 583-593.
[15]
H. Sayyadi and L. Raschid, A graph analytical approach for topic detection, ACM Trans. Internet Technol., vol. 13, no. 2, p. 4, 2013.
[16]
T. T. Zhang, B. Lee, Q. H. Zhu, X. Han, and E. M. Ye, Multi- dimension topic mining based on hierarchical semantic graph model, IEEE Access, vol. 8, pp. 64820-64835, 2020.
[17]
A. Hamm, J. Thelen, R. Beckmann, and S. Odrowski, TeCoMiner: Topic discovery through term community detection, arXiv preprint arXiv: 2103.12882, 2021.
[18]
M. N. Azadani, N. Ghadiri, and E. Davoodijam, Graph-based biomedical text summarization: An itemset mining and sentence clustering approach. Journal of Biomedical Informatics, vol. 84, pp. 42-58, 2018
[19]
B. Drury, C. Rocha, M.-F. Moura, and A. Lopes, The extraction from news stories a causal topic centred bayesian graph for sugarcane, in Proceedings of the 20th International Database Engineering & Applications Symposium, Montreal, Canada, pp. 364-369, 2016.
[20]
U. Kang, H. H. Tong, and J. M. Sun, Fast random walk graph kernel, in Proceedings of the 12th SIAM international conference on data mining (SDM), Los Angeles, CA, USA, pp. 828-838, 2012.
[21]
N. Shervashidze and K. M. Borgwardt, Fast subtree kernels on graphs, in Proceedings of the Conference on Advances in Neural Information Processing Systems, Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, eds. Red Hook, NY, USA: Curran Associates Inc., pp. 1660-1668, 2009.
[22]
G. Nikolentzos, P. Meladianos, F. Rousseau, M. Vazirgiannis, and Y. Stavrakas, Shortest-path graph kernels for document similarity, in Proc. 2017 Conf. on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 2017, pp. 1890-1900.
[23]
M. E. J. Newman, Detecting community structure in networks, Eur. Phys. J. B, vol. 38, no. 2, pp. 321-330, 2004.
[24]
T. Mikolov, I. Sutskever, C. Kai, G. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, in Proc. 26th Int. Conf. on Neural Information Processing Systems, Lake Tahoe, NV, USA, 2013, pp. 3111-3119.
[25]
X. J. Zhang, Z. T. Liu, W. Liu, J. H. Yang, and S. N. Fei, Chinese event classification for event ontology construction, J. Comput. Inf. Syst., vol. 9, no. 9, pp. 3511-3519, 2013.
[26]
M. S. Sun, J. Y. Li, Z. P. Guo, Y. Zhao, Y. B. Zheng, X. C. Si, and Z. Y. Liu, THUCTC: An efficient Chinese text classifier, (in Chinese), https://github.com/diuzi/THUCTC, 2016.
[27]
J. G. Fiscus and G. R. Doddington, Topic detection and tracking evaluation overview, in Topic Detection and Tracking: Event-Based Information Organization, Dordrecht, Netherlands: Kluwer Academic Publishers, 2002, pp. 17-31.
DOI
[28]
J. Allan, R. Papka, V. Lvrenko, On-line new event detection and tracking, http://omega.sp.susu.ru/books/acm_sigmod/vol2/is3/SIGIR1998/P037.pdf, 2017.
DOI
[29]
P. P. Zhou, Z. Cao, B. Wu, C. Z. Wu, and S. Q. Yu, EDM- JBW: A novel event detection model based on JS-ID’Forder and Bikmeans with word embedding for news streams, J. Comput. Sci., vol. 28, pp. 336-342, 2018.
[30]
E. Rasouli, S. Zarifzadeh, and A. J. Rafsanjani, WebKey: A graph-based method for event detection in web news, J. Intell. Inf. Syst., vol. 54, no. 3, pp. 585-604, 2020.
Publication history
Copyright
Rights and permissions

Publication history

Received: 05 November 2021
Accepted: 18 November 2021
Published: 25 January 2022
Issue date: June 2022

Copyright

© The author(s) 2022.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return