Journal Home > Volume 26 , Issue 6

The internet is an abundant source of news every day. Thus, efficient algorithms to extract keywords from the text are important to obtain information quickly. However, the precision and recall of mature keyword extraction algorithms need improvement. TextRank, which is derived from the PageRank algorithm, uses word graphs to spread the weight of words. The keyword weight propagation in TextRank focuses only on word frequency. To improve the performance of the algorithm, we propose Semantic Clustering TextRank (SCTR), a semantic clustering news keyword extraction algorithm based on TextRank. Firstly, the word vectors generated by the Bidirectional Encoder Representation from Transformers (BERT) model are used to perform k-means clustering to represent semantic clustering. Then, the clustering results are used to construct a TextRank weight transfer probability matrix. Finally, iterative calculation of word graphs and extraction of keywords are performed. The test target of this experiment is a Chinese news library. The results of the experiment conducted on this text set show that the SCTR algorithm has greater precision, recall, and F1 value than the traditional TextRank and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms.


menu
Abstract
Full text
Outline
About this article

News Keyword Extraction Algorithm Based on Semantic Clustering and Word Graph Model

Show Author's information Ao XiongDerong LiuHongkang TianZhengyuan LiuPeng Yu( )Michel Kadoch
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
École de Technologie Supérieure, Université du Québec, Montreal, H3C 1K3, Canada

Abstract

The internet is an abundant source of news every day. Thus, efficient algorithms to extract keywords from the text are important to obtain information quickly. However, the precision and recall of mature keyword extraction algorithms need improvement. TextRank, which is derived from the PageRank algorithm, uses word graphs to spread the weight of words. The keyword weight propagation in TextRank focuses only on word frequency. To improve the performance of the algorithm, we propose Semantic Clustering TextRank (SCTR), a semantic clustering news keyword extraction algorithm based on TextRank. Firstly, the word vectors generated by the Bidirectional Encoder Representation from Transformers (BERT) model are used to perform k-means clustering to represent semantic clustering. Then, the clustering results are used to construct a TextRank weight transfer probability matrix. Finally, iterative calculation of word graphs and extraction of keywords are performed. The test target of this experiment is a Chinese news library. The results of the experiment conducted on this text set show that the SCTR algorithm has greater precision, recall, and F1 value than the traditional TextRank and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms.

Keywords: keyword extraction, TextRank, semantics, word vector

References(20)

[1]
A. Z. Guo and T. Yang, Research and improvement of feature words weight based on TFIDF algorithm, presented at 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conf., Chongqing, China, 2016, pp. 415-419.
DOI
[2]
J. Z. Li, Q. N. Fan, and K. Zhang, Keyword extraction based on tf/idf for Chinese news document, Wuhan Univ. J. Nat. Sci., vol. 12, no. 5, pp. 917-921, 2007.
[3]
H. X. Wei, G. L. Gao, and X. D. Su, LDA-based word image representation for keyword spotting on historical Mongolian documents, in Proc. 23rd Int. Conf. Neural Information Processing, Kyoto, Japan, 2016, pp. 432-441.
DOI
[4]
R. Mihalcea and P. Tarau, TextRank: Bringing order into text, in Proc. 2004 Conf. Empirical Methods in Natural Language Process., Stroudsburg, PA, USA, 2004, pp. 404-411.
[5]
H. Chen, C. Yin, R. Li, W. Rong, Z. Xiong, and B. David, Enhanced learning resource recommendation based on online learning style model, Tsinghua Science and Technology, vol. 25, no. 3, pp. 348-356, 2020.
[6]
X. Y. Zhang, Y. B. Wang, and L. Wu, Research on cross language text keyword extraction based on information entropy and TextRank, presented at 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conf. (ITNEC), Chengdu, China, 2019, pp. 16-19.
DOI
[7]
Y. Luo, S. L. Zhao, X. C. Li, Y. H. Han, and Y. F. Ding, Text keyword extraction method based on word frequency statistics, (in Chinese), J. Comput. Appl., vol. 36, no. 3, pp. 718-725, 2016.
[8]
H. T. Geng, Q. S. Cai, K. Yu, and P. Zhao, A kind of automatic text keyphrase extraction method based on word co-occurrence, (in Chinese), J. Nanjing Univ. (Nat. Sci.), vol. 42, no. 2, pp. 156-162, 2006.
[9]
Y. J. Gu and T. Xia, Study on keyword extraction with LDA and TextRank combination, (in Chinese), Data Anal. Knowl. Discov., vol. 30, no. 7, pp. 41-47, 2014.
[10]
F. Jiang, G. H. Li, and X. Yun, Semantic-based keyword extraction method for document, International Journal of u-and e-Service, Science and Technology, vol. 8, no. 5, pp. 37-46, 2015.
[11]
H. Li, C. L. Tang, X. Yang, and W. T. Shen, TextRank keyword extraction based on multi feature fusion, (in Chinese), J. Intell., vol. 36, no. 8, pp. 183-187, 2017.
[12]
L. Tian and S. Wang, Improved bag-of-words model for person re-identification, Tsinghua Science and Technology, vol. 23, no. 2, pp. 145-156, 2018.
[13]
Z. Shen, B. Yong, G. Zhang, R. Zhou, and Q. Zhou, A deep learning method for Chinese singer identification, Tsinghua Science and Technology, vol. 24, no. 4, pp. 371-378, 2019.
[14]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, https://arxiv.org/abs/1301.3781, 2013.
[15]
Y. P. Li, C. Jin, and J. C. Ji, A keyword extraction algorithm based on Word2vec, (in Chinese), E-Sci. Technol. Appl., vol. 6, no. 4, pp. 54-59, 2015.
[16]
J. Z. Zhou and X. H. Cui, Keyword extraction method based on word vector and TextRank, (in Chinese), Appl. Res. Comput., vol. 36, no. 4, pp. 1051-1054, 2019.
[17]
Y. J. Wen, H. Yuan, and P. Z. Zhang, Research on keyword extraction based on Word2Vec weighted TextRank, presented at 2016 2nd IEEE Int. Conf. Computer and Communications (ICCC), Chengdu, China, 2016, pp. 2109- 2113.
[18]
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, https://arxiv.org/abs/1810.04805, 2019.
[19]
D. Miller, Leveraging BERT for extractive text summarization on lectures, https://arxiv.org/abs/1906.04165, 2019.
[20]
L. Zhang, C. B. Xu, Y. H. Gao, Y. Han, X. J. Du, and Z. H. Tian, Improved Dota2 lineup recommendation model based on a bidirectional LSTM, Tsinghua Science and Technology, vol. 25, no. 6, pp. 712-720, 2020.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 09 September 2020
Accepted: 09 October 2020
Published: 09 June 2021
Issue date: December 2021

Copyright

© The author(s) 2021.

Acknowledgements

This work was supported by the National Key R&D Program of China (No. 2018YFE0205502) and the National Natural Science Foundation of China (No. 61672108).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return