News Keyword Extraction Algorithm Based on Semantic Clustering and Word Graph Model

Ao Xiong; Derong Liu; Hongkang Tian; Zhengyuan Liu; Peng Yu; Michel Kadoch

doi:10.26599/TST.2020.9010051

Tsinghua Science and Technology 2021, 26(6): 886-893 https://doi.org/10.26599/TST.2020.9010051

Open Access | Issue | Published: 09 June 2021

News Keyword Extraction Algorithm Based on Semantic Clustering and Word Graph Model

Show Author's Information Hide Author's Information Ao Xiong, Derong Liu, Hongkang Tian, Zhengyuan Liu, Peng Yu(

), Michel Kadoch

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

École de Technologie Supérieure, Université du Québec, Montreal, H3C 1K3, Canada

Keywords:

keyword extraction, TextRank, semantics, word vector

Cite this article:

Xiong A, Liu D, Tian H, et al. News Keyword Extraction Algorithm Based on Semantic Clustering and Word Graph Model. Tsinghua Science and Technology, 2021, 26(6): 886-893. https://doi.org/10.26599/TST.2020.9010051

Download citation

EndNote(RIS)

BibTeX

658

Views

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

The internet is an abundant source of news every day. Thus, efficient algorithms to extract keywords from the text are important to obtain information quickly. However, the precision and recall of mature keyword extraction algorithms need improvement. TextRank, which is derived from the PageRank algorithm, uses word graphs to spread the weight of words. The keyword weight propagation in TextRank focuses only on word frequency. To improve the performance of the algorithm, we propose Semantic Clustering TextRank (SCTR), a semantic clustering news keyword extraction algorithm based on TextRank. Firstly, the word vectors generated by the Bidirectional Encoder Representation from Transformers (BERT) model are used to perform k-means clustering to represent semantic clustering. Then, the clustering results are used to construct a TextRank weight transfer probability matrix. Finally, iterative calculation of word graphs and extraction of keywords are performed. The test target of this experiment is a Chinese news library. The results of the experiment conducted on this text set show that the SCTR algorithm has greater precision, recall, and F1 value than the traditional TextRank and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms.

Full text

Abstract

Full text

Outline

About this article

News Keyword Extraction Algorithm Based on Semantic Clustering and Word Graph Model

Show Author's information Hide Author's Information Ao Xiong, Derong Liu, Hongkang Tian, Zhengyuan Liu, Peng Yu(

), Michel Kadoch

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

École de Technologie Supérieure, Université du Québec, Montreal, H3C 1K3, Canada

Abstract

Keywords: keyword extraction, TextRank, semantics, word vector

References(20)

[1]

A. Z. Guo and T. Yang, Research and improvement of feature words weight based on TFIDF algorithm, presented at 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conf., Chongqing, China, 2016, pp. 415-419.

DOI

[2]

J. Z. Li, Q. N. Fan, and K. Zhang, Keyword extraction based on tf/idf for Chinese news document, Wuhan Univ. J. Nat. Sci., vol. 12, no. 5, pp. 917-921, 2007.

DOI Google Scholar

[3]

H. X. Wei, G. L. Gao, and X. D. Su, LDA-based word image representation for keyword spotting on historical Mongolian documents, in Proc. 23rd Int. Conf. Neural Information Processing, Kyoto, Japan, 2016, pp. 432-441.

DOI

[4]

R. Mihalcea and P. Tarau, TextRank: Bringing order into text, in Proc. 2004 Conf. Empirical Methods in Natural Language Process., Stroudsburg, PA, USA, 2004, pp. 404-411.

[5]

H. Chen, C. Yin, R. Li, W. Rong, Z. Xiong, and B. David, Enhanced learning resource recommendation based on online learning style model, Tsinghua Science and Technology, vol. 25, no. 3, pp. 348-356, 2020.

DOI Google Scholar

[6]

X. Y. Zhang, Y. B. Wang, and L. Wu, Research on cross language text keyword extraction based on information entropy and TextRank, presented at 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conf. (ITNEC), Chengdu, China, 2019, pp. 16-19.

DOI

[7]

Y. Luo, S. L. Zhao, X. C. Li, Y. H. Han, and Y. F. Ding, Text keyword extraction method based on word frequency statistics, (in Chinese), J. Comput. Appl., vol. 36, no. 3, pp. 718-725, 2016.

Google Scholar

[8]

H. T. Geng, Q. S. Cai, K. Yu, and P. Zhao, A kind of automatic text keyphrase extraction method based on word co-occurrence, (in Chinese), J. Nanjing Univ. (Nat. Sci.), vol. 42, no. 2, pp. 156-162, 2006.

Google Scholar

[9]

Y. J. Gu and T. Xia, Study on keyword extraction with LDA and TextRank combination, (in Chinese), Data Anal. Knowl. Discov., vol. 30, no. 7, pp. 41-47, 2014.

Google Scholar

[10]

F. Jiang, G. H. Li, and X. Yun, Semantic-based keyword extraction method for document, International Journal of u-and e-Service, Science and Technology, vol. 8, no. 5, pp. 37-46, 2015.

DOI Google Scholar

[11]

H. Li, C. L. Tang, X. Yang, and W. T. Shen, TextRank keyword extraction based on multi feature fusion, (in Chinese), J. Intell., vol. 36, no. 8, pp. 183-187, 2017.

Google Scholar

[12]

L. Tian and S. Wang, Improved bag-of-words model for person re-identification, Tsinghua Science and Technology, vol. 23, no. 2, pp. 145-156, 2018.

DOI Google Scholar

[13]

Z. Shen, B. Yong, G. Zhang, R. Zhou, and Q. Zhou, A deep learning method for Chinese singer identification, Tsinghua Science and Technology, vol. 24, no. 4, pp. 371-378, 2019.

DOI Google Scholar

[14]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, https://arxiv.org/abs/1301.3781, 2013.

[15]

Y. P. Li, C. Jin, and J. C. Ji, A keyword extraction algorithm based on Word2vec, (in Chinese), E-Sci. Technol. Appl., vol. 6, no. 4, pp. 54-59, 2015.

Google Scholar

[16]

J. Z. Zhou and X. H. Cui, Keyword extraction method based on word vector and TextRank, (in Chinese), Appl. Res. Comput., vol. 36, no. 4, pp. 1051-1054, 2019.

Google Scholar

[17]

Y. J. Wen, H. Yuan, and P. Z. Zhang, Research on keyword extraction based on Word2Vec weighted TextRank, presented at 2016 2nd IEEE Int. Conf. Computer and Communications (ICCC), Chengdu, China, 2016, pp. 2109- 2113.

[18]

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, https://arxiv.org/abs/1810.04805, 2019.

[19]

D. Miller, Leveraging BERT for extractive text summarization on lectures, https://arxiv.org/abs/1906.04165, 2019.

[20]

L. Zhang, C. B. Xu, Y. H. Gao, Y. Han, X. J. Du, and Z. H. Tian, Improved Dota2 lineup recommendation model based on a bidirectional LSTM, Tsinghua Science and Technology, vol. 25, no. 6, pp. 712-720, 2020.

DOI Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 09 September 2020

Accepted: 09 October 2020

Published: 09 June 2021

Issue date: December 2021

Copyright

Acknowledgements

This work was supported by the National Key R&D Program of China (No. 2018YFE0205502) and the National Natural Science Foundation of China (No. 61672108).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).