Measuring Similarity of Academic Articles with Semantic Profile and Joint Word Embedding

Ming Liu; Bo Lang; Zepeng Gu; Ahmed Zeeshan

doi:10.23919/TST.2017.8195345

Tsinghua Science and Technology 2017, 22(6): 619-632 https://doi.org/10.23919/TST.2017.8195345

Open Access | Issue | Published: 14 December 2017

Measuring Similarity of Academic Articles with Semantic Profile and Joint Word Embedding

Show Author's Information Hide Author's Information Ming Liu(

), Bo Lang, Zepeng Gu, Ahmed Zeeshan

State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China.

Keywords:

word embedding, document semantic similarity, text understanding, semantic enrichment, scientific literature analysis

Cite this article:

Liu M, Lang B, Gu Z, et al. Measuring Similarity of Academic Articles with Semantic Profile and Joint Word Embedding. Tsinghua Science and Technology, 2017, 22(6): 619-632. https://doi.org/10.23919/TST.2017.8195345

Download citation

EndNote(RIS)

BibTeX

398

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.

Full text

Abstract

Full text

Outline

About this article

Measuring Similarity of Academic Articles with Semantic Profile and Joint Word Embedding

Show Author's information Hide Author's Information Ming Liu(

), Bo Lang, Zepeng Gu, Ahmed Zeeshan

State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China.

Abstract

Keywords: word embedding, document semantic similarity, text understanding, semantic enrichment, scientific literature analysis

References(38)

[1]

Tenenbaum J. B., Kemp C., Griffiths T. L., and Goodman N. D., How to grow a mind: Statistics, structure, and abstraction, Science, vol. 331, no. 6022, pp. 1279-1285, 2011.

DOI Google Scholar

[2]

Pan J. Y., Cheng C. P. J., Lau G. T., and Law K. H., Utilizing statistical semantic similarity techniques for ontology mapping—with applications to AEC standard models, Tsinghua Sci. Technol., vol. 13, no. S1, pp. 217-222, 2008.

DOI Google Scholar

[3]

Leacock C. and Chodorow M., Combining Local Context and WordNet Similarity for Word Sense Identification. The MIT Press, 1998.

[4]

Mikolov T., Chen K., Corrado G., and Dean J., Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781, 2013.

[5]

Resnik P., Using information content to evaluate semantic similarity in a taxonomy, in Proc. 14th Int. Joint Conf. Artificial Intelligence, Montreal, Canada, 1995.

[6]

Rus V., Lintean M. C., Graesser A., and McNamara D., Assessing student paraphrases using lexical semantics and word weighting, in Proc. 14th Int. Conf. Artificial Intelligence in Education, Brighton, UK, 2009.

[7]

Corley C. and Mihalcea R., Measuring the semantic similarity of texts, in Proc. ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI, USA, 2005, pp. 13-18.

DOI

[8]

Xu Z., Luo X. F., Zhang S. X., Wei X., Mei L., and Hu C. P., Mining temporal explicit and implicit semantic relations between entities using web search engines, Future Generat. Comput. Syst., vol. 37, pp. 468-477, 2014.

DOI Google Scholar

[9]

Xu Z., Luo X. F., Yu J., and Xu W. M., Measuring semantic similarity between words by removing noise and redundancy in web snippets, Concurr. Comput. Pract. Exp., vol. 23, no. 18, pp. 2496-2510, 2011.

DOI Google Scholar

[10]

Xu Z., Luo X. F., Mei L., and Hu C. P., Measuring the semantic discrimination capability of association relations, Concurr. Comput. Pract. Exp., vol. 26, no. 2, pp. 380-395, 2014.

DOI Google Scholar

[11]

Agirre E., Banea C., Cardie C., Cer D., Diab M., Gonzalez-Agirre A., Guo W. W., Lopez-Gazpio I., Maritxalar M., Mihalcea R., et al., SemEval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability, in Proc. 9th Int. Workshop on Semantic Evaluation (SemEval 2015), Denver, CO, USA, 2015.

DOI

[12]

Šaric F., Glavaš G., Karan M., Šnajder J., and Bašic B. D., Takelab: Systems for measuring semantic text similarity, in Proc. 6th Int. Workshop on Semantic Evaluation, Montréal, Canada, 2012, pp. 441-448.

[13]

Bär D., Biemann C., Gurevych I., and Zesch T., UKP: Computing semantic textual similarity by combining multiple content similarity measures, in Proc. 1st Joint Conf. Lexical and Computational Semantics, Montréal, Canada, 2012.

[14]

Han W., Zhu X., Zhu Z., Chen W., Zheng W., and Lu J., A comparative analysis on weibo and twitter, Tsinghua Sci. Technol., vol. 21, no. 1, pp. 1-16, 2016.

DOI Google Scholar

[15]

Zhang M. Y., Qin B., Liu T., and Zheng M., Triple based background knowledge ranking for document enrichment, in Proc. COLING 2014, the 25th Int. Conf. Computational Linguistics: Technical Papers, Dublin, Ireland, 2014.

[16]

Schuhmacher M. and Ponzetto S. P., Knowledge-based graph document modeling, in Proc.7th ACM Int. Conf. Web Search and Data Mining, New York, NY, USA, 2014, pp. 543-552.

DOI

[17]

Ramage D., Rafferty A. N., and Manning C. D., Random walks for text semantic similarity, in Proc. 2009 Workshop on Graph-Based Methods for Natural Language Processing, Suntec, Singapore, 2009, pp. 23-31.

DOI

[18]

Zhang M. Y., Qin B., Zheng M., Hirst G., and Liu T., Encoding distributional semantics into triple-based knowledge ranking for document enrichment, in Proc. 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int. Joint Conf. Natural Language Processing, Beijing, China, 2015.

DOI

[19]

Salton G., Wong A., and Yang C. S., A vector space model for automatic indexing, Commun. ACM, vol. 11, no. 11, pp. 613-620, 1975.

DOI Google Scholar

[20]

Miller G. A., WordNet: A lexical database for English, Commun. ACM, vol. 38, no. 11, pp. 39-41, 1995.

DOI Google Scholar

[21]

Bollacker K., Evans C., Paritosh P., Sturge T., and Taylor J., Freebase: A collaboratively created graph database for structuring human knowledge, in Proc. 2008 ACM SIGMOD Int. Conf. Management of Data, Vancouver, Canada, 2008, pp. 1247-1250.

DOI

[22]

Landauer T. K., Foltz P. W., and Laham D., An introduction to latent semantic analysis, Dis. Process., vol. 25, nos. 2&3, pp. 259-284, 1998.

DOI Google Scholar

[23]

Wang D. Q., Zhang H., Liu R., Liu X. L., and Wang J., Unsupervised feature selection through Gram-Schmidt orthogonalization—A word co-occurrence perspective, Neurocomputing, vol. 173, pp. 845-854, 2016.

DOI Google Scholar

[24]

Blei D. M., Ng A. Y., and Jordan M. I., Latent dirichlet allocation, J. Mach. Learn. Res., vol. 3, pp. 993-1022, 2003.

Google Scholar

[25]

Bengio Y., Schwenk H., Senécal J. S., Morin F., and Gauvain J. L., Neural probabilistic language models, in Innovations in Machine Learning, Holmes D. E. and Jain L. C., eds. Springer, 2006, pp. 137-186.

[26]

Le Q. V. and Mikolov T., Distributed representations of sentences and documents, arXiv preprint arXiv: 1405.4053, 2014.

[27]

Auer S., Bizer C., Kobilarov G., Lehmann J., Cyganiak R., and Ives Z., DBpedia: A nucleus for a web of open data, in The Semantic Web, Aberer K., Choi K. S., Noy N., Allemang D., Lee K. I., Nixon L., Golbeck J., Mika P., Maynard D., Mizoguchi R., et al., eds. Springer, 2007.

[28]

Pennington J., Socher R., and Manning C. D., GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014.

[29]

Gabrilovich E. and Markovitch S., Computing semantic relatedness using Wikipedia-based explicit semantic analysis, in Proc. 20th Int.Joint Conf. Artifical Intelligence, Hyderabad, India, 2007, pp. 1606-1611.

[30]

Rafi M. and Shaikh M. S., An improved semantic similarity measure for document clustering based on topic maps, arXiv preprint arXiv: 1303.4087, 2013.

[31]

Rus V., Niraula N., and Banjade R., Similarity measures based on latent Dirichlet allocation, in Computational Linguistics and Intelligent Text Processing, Gelbukh A., ed. Springer, 2013, pp. 459-470.

DOI

[32]

Rus V., Lintean M., Banjade R., Niraula N., and Stefanescu D., SEMILAR: The semantic similarity toolkit, in Proc. 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 163-168.

[33]

Wu Z. B. and Palmer M., Verb semantics and lexical selection, in Proc. 32nd Annual Meeting on Association for Computational Linguistics, Las Cruces, NM, USA, 1994, pp. 133-138.

[34]

Fried D. and Duh K., Incorporating both distributional and relational semantics in word representations, arXiv preprint arXiv: 1412.4369, 2014.

[35]

Yu M. and Dredze M., Improving lexical embeddings with semantic knowledge, in Proc. 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), Baltimore, MD, USA, 2014, pp. 545-550.

DOI

[36]

Radev D. R., Muthukrishnan P., and Qazvinian V., The ACL anthology network corpus, in Proc. 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, Suntec, Singapore, 2009, pp. 54-61.

DOI

[37]

Dolan B., Quirk C., and Brockett C., Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources, in Proc. 20th Int. Conf. Computational Linguistics, Geneva, Switzerland, 2004, p. 350.

DOI

[38]

Rus V., Lintean M., Moldovan C., Baggett W., Niraula N., and Morgan B., The SIMILAR corpus: A resource to foster the qualitative understanding of semantic similarity of texts, in Proc. 8th Language Resources and Evaluation Conf., Instanbul, Turkey, 2012, pp. 23-25.

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 31 December 2016

Revised: 22 April 2017

Accepted: 14 June 2017

Published: 14 December 2017

Issue date: December 2017

Copyright

Acknowledgements

This research was supported by the Foundation of the State Key Laboratory of Software Development Environment (No. SKLSDE-2015ZX-04).