Journal Home > Volume 22 , Issue 2

Hashtags are important metadata in microblogs and are used to mark topics or index messages. However, statistics show that hashtags are absent from most microblogs. This poses great challenges for the retrieval and analysis of these tagless microblogs. In this paper, we summarize the similarity between microblogs and short-message-style news, and then propose an algorithm, named 5WTAG, for detecting microblog topics based on a model of five Ws (When, Where, Who, What, hoW). As five-W attributes are the core components in event description, it is guaranteed theoretically that 5WTAG can properly extract semantic topics from microblogs. We introduce the detailed procedure of the algorithm in this paper including spam microblog identification, microblog segmentation, and candidate hashtag construction. In addition, we propose a novel recommendation computing method for ranking candidate hashtags, which combines syntax and semantic analysis and observes the distribution of artificial topic hashtags. Finally, we conduct comprehensive experiments to verify the semantic correctness and completeness of the candidate hashtags, as well as the accuracy of the recommendation method using real data from Sina Weibo.


menu
Abstract
Full text
Outline
About this article

Modeling Chinese Microblogs with Five Ws for Topic Hashtags Extraction

Show Author's information Zhibin ZhaoJiahong SunLan Yao( )Xun WangJiahong ChuHuan LiuGe Yu
College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China.

Abstract

Hashtags are important metadata in microblogs and are used to mark topics or index messages. However, statistics show that hashtags are absent from most microblogs. This poses great challenges for the retrieval and analysis of these tagless microblogs. In this paper, we summarize the similarity between microblogs and short-message-style news, and then propose an algorithm, named 5WTAG, for detecting microblog topics based on a model of five Ws (When, Where, Who, What, hoW). As five-W attributes are the core components in event description, it is guaranteed theoretically that 5WTAG can properly extract semantic topics from microblogs. We introduce the detailed procedure of the algorithm in this paper including spam microblog identification, microblog segmentation, and candidate hashtag construction. In addition, we propose a novel recommendation computing method for ranking candidate hashtags, which combines syntax and semantic analysis and observes the distribution of artificial topic hashtags. Finally, we conduct comprehensive experiments to verify the semantic correctness and completeness of the candidate hashtags, as well as the accuracy of the recommendation method using real data from Sina Weibo.

Keywords: hashtag, microblog, topic detection, short-message-style news, five Ws

References(32)

[1]
Bruns A. and Burgess J., The use of twitter hashtags in the formation of ad hoc publics, in Proc. 6th European Consortium for Political Research General Conf., Reykjavík, Iceland, 2011, pp. 1-9.
[2]
[3]
Meng X. F., Wei F. R., Liu X. H., Zhou M., Li S. J. and Wang H. F., Entity-centric topic-oriented opinion summarization in twitter. in Proc. 18th Annual ACM Conf. on Knowledge Discovery and Data Mining, New York, NY, USA, 2012, pp. 379-387.
DOI
[4]
Yap I., Loh H. T., Shen L. X., and Liu Y., Topic detection using MFSS, in Proc. 19th Int. conf. on Advances in Applied Artificial Intelligence, Berlin, Germany, 2006, pp. 342-352.
DOI
[5]
Seo Y. W. and Sycara K.. (2004, Jan.). Text clustering for topic detection. Available: https://www.researchgate.net/publication/2901255_Text_Clusterin_for_Topic_Detection_Young-Woo.
DOI
[6]
Zhang X. Y. and Wang T., Research of technologies on topic detection and tracking, Journal of Frontiers of Computer Science and Technology, vol. 3, no. 4, pp. 347-357, 2009.
[7]
Hofmann T., Probabilistic latent semantic indexing, in Proc. 22nd Int. ACM SIGIR Conf., Berkeley, CA, USA, 1999, pp. 50-57.
DOI
[8]
Blei D. M., Ng A. Y., and Jordan M. I., Latent dirichlet allocation, Journal of Machine Learning Research, vol. 2003, no. 3, pp. 993-1022, 2003.
[9]
Xu G. and Wang H. F., The development of topic models in natural language processing, (in Chinese), Chinese Journal of Computers, vol. 34, no. 8, pp. 1423-1436, 2011.
[10]
Zhao W. X., Jiang J., Weng J. S., He J., Lim E. P., Yan H. F., and Li X. M., Comparing twitter and traditional media using topic models, in Proc. 33rd European Conf. on Advances in Information Retrieval, Dublin, Ireland, 2011, pp. 338-349.
DOI
[11]
Weng J. S., Lim E. P., Jiang J., and He Q., Twitterrank: Finding topic-sensitive influential twitterers, in Proc. 3rd Int. Conf. on Web Search and Web Data Mining, New York, NY, USA, 2010, pp. 261-270.
DOI
[12]
Yan X. H., Guo J. F., Lan Y. Y., and Cheng X. Q., A biterm topic model for short texts, in Proc. 22nd Int. Conf. on World Wide Web, Rio de Janeiro, Brazil, 2013, pp. 1445-1456.
DOI
[13]
Cheng X. Q., Yan X. H., Lan Y. Y., and Guo J. F., BTM: Topic modeling over short texts, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 12, pp. 2928-2941, 2014.
[14]
Mehrotra R., Sanner S., Buntine W., and Xie L. X., Improving LDA topic models for microblogs via tweet pooling and automatic labeling, in Proc. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Dublin, Ireland, 2013, pp. 889-892.
DOI
[15]
Quan X. J., Kit C. Y., Ge Y., and Pan S. J. L., Short and sparse text topic modeling via self-aggregation, in Proc. 24th Int. Joint Conf. on Artificial Intelligence, Buenos Aires, Argentina, 2015, pp. 2270-2276.
[16]
Li J., Liao M., Gao W., He Y. L., and Wong K. F., Topic extraction from microblog posts using conversation structures, in Proc. 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 2016, pp. 2114-2123.
DOI
[17]
Pang B. and Lee L., Opinion mining and sentiment analysis, Foundations and Trends in Information Retrieval, vol. 2, nos. 1&2, pp. 1-135, 2008.
[18]
Davidov D., Tsur O., and Rappoport A., Enhanced sentiment learning using twitter hashtags and smileys, in Proc. 23rd Int. Conf. on Computational Linguistics, Beijing, China, 2010, pp. 241-249.
[19]
Jiang L., Yu M., Zhou M., Liu X. H., and Zhao T. J., Target-dependent twitter sentiment classification, in Proc. 49th Annual Meeting of the Association for Computational Linguistics, Portland, OR, USA, 2011, pp. 151-160.
[20]
Wang X. L., Wei F. R., Liu X. H., Zhou M., and Zhang M., Topic sentiment analysis in twitter: A graph-based hashtag sentiment classification approach, in Proc. 20th ACM Conf. on Information and Knowledge Management, Glasgow, Scotland, UK, 2011, pp. 1031-1040.
DOI
[21]
Wilson T.. (2005, April 11). The arte of rhetorique. Available: https://scholarsbank.uoregon.edu/xmlui/bitstream/handle/1794/774/arte.pdf.
[22]
Hart G.. (2002). The five ws of online help systems. Available: http://www.geoff-hart.com/articles/2002/fivew.htm.
[23]
Jiang Y. L., Lin C. X. D., and Mei Q. Z., Context comparison of bursty events in web search and online media, in Proc. 2010 Conf. on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 2010, pp. 1077-1087.
[24]
Java A., Song X., Finin T., and Tseng B., Why we twitter: An analysis of a microblogging community, in Proc. 9th Int. Workshop on Knowledge Discovery on the Web, and 1st Int. Workshop on Social Networks Analysis, San Jose, CA, USA, 2007, pp. 118-138.
DOI
[25]
Blake K.. (2007, June 4). Inverted pyramid story format. Available: http://kelab.tamu.edu/SPB_Encylopedia/data/Inverted%20pyramid%20story%20format.pdf.
[26]
Zhang H. P.. (2013, April 16). NLPIR-ICTCLAS2013-Win-64bit-JNI(u0416). Available: http://ictclas.nlpir.org/newsdownloads?DocId=353.
[27]
Wang L., Study and implementation on data cleaning and sentiment analysis techniques for Chinese microblog, Master degree dissertation, Dept. Computer Science, Northeastern University, Shenyang, China, 2012.
[28]
Lin W., Feng S. H., Xu W. L., Yang Z., Wang D. L., and Zhang Y. F., An filter approach for spam discrimination and content similarity double detection for microblog text stream, Journal of Computer Applications and Software, vol. 29, no. 8, pp. 25-29, 2012.
[29]
Li Y. Q. and Sun L. H., Hot-word detection for internet public sentiment, Journal of Chinese Information Processing, vol. 25, no. 1, pp. 48-53, 2011.
[30]
Kise K., Mizuno H., Yamaguchi M., and Matsumoto K., On the use of density distribution of keywords for automated generation of hypertext links from arbitrary parts of documents, document analysis and recognition, in Proc. 5th Int. Conf. on Document Analysis and Recognition, Bangalore, India, 1999, pp. 301-304.
DOI
[31]
Frey B. J. and Dueck D., Clustering by passing messages between data points, Science, vol. 315, no. 5814, pp. 972-976, 2007.
[32]
Development report for Sina Weibo in 2016. (2017, Jan. 11). Available: http://data.weibo.com/report/report?m=m.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 31 August 2016
Revised: 24 December 2016
Accepted: 26 December 2016
Published: 06 April 2017
Issue date: April 2017

Copyright

© The author(s) 2017

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 61173027) and the Northeastern University Fundamental Research Funds for the Central Universities (Nos. N150404012 and N140404006).

Rights and permissions

Return