Journal Home > Volume 22 , Issue 6

Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.


menu
Abstract
Full text
Outline
About this article

Measuring Similarity of Academic Articles with Semantic Profile and Joint Word Embedding

Show Author's information Ming Liu( )Bo LangZepeng GuAhmed Zeeshan
State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China.

Abstract

Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.

Keywords: word embedding, document semantic similarity, text understanding, semantic enrichment, scientific literature analysis

References(38)

[1]
Tenenbaum J. B., Kemp C., Griffiths T. L., and Goodman N. D., How to grow a mind: Statistics, structure, and abstraction, Science, vol. 331, no. 6022, pp. 1279-1285, 2011.
[2]
Pan J. Y., Cheng C. P. J., Lau G. T., and Law K. H., Utilizing statistical semantic similarity techniques for ontology mapping—with applications to AEC standard models, Tsinghua Sci. Technol., vol. 13, no. S1, pp. 217-222, 2008.
[3]
Leacock C. and Chodorow M., Combining Local Context and WordNet Similarity for Word Sense Identification. The MIT Press, 1998.
[4]
Mikolov T., Chen K., Corrado G., and Dean J., Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781, 2013.
[5]
Resnik P., Using information content to evaluate semantic similarity in a taxonomy, in Proc. 14th Int. Joint Conf. Artificial Intelligence, Montreal, Canada, 1995.
[6]
Rus V., Lintean M. C., Graesser A., and McNamara D., Assessing student paraphrases using lexical semantics and word weighting, in Proc. 14th Int. Conf. Artificial Intelligence in Education, Brighton, UK, 2009.
[7]
Corley C. and Mihalcea R., Measuring the semantic similarity of texts, in Proc. ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI, USA, 2005, pp. 13-18.
DOI
[8]
Xu Z., Luo X. F., Zhang S. X., Wei X., Mei L., and Hu C. P., Mining temporal explicit and implicit semantic relations between entities using web search engines, Future Generat. Comput. Syst., vol. 37, pp. 468-477, 2014.
[9]
Xu Z., Luo X. F., Yu J., and Xu W. M., Measuring semantic similarity between words by removing noise and redundancy in web snippets, Concurr. Comput. Pract. Exp., vol. 23, no. 18, pp. 2496-2510, 2011.
[10]
Xu Z., Luo X. F., Mei L., and Hu C. P., Measuring the semantic discrimination capability of association relations, Concurr. Comput. Pract. Exp., vol. 26, no. 2, pp. 380-395, 2014.
[11]
Agirre E., Banea C., Cardie C., Cer D., Diab M., Gonzalez-Agirre A., Guo W. W., Lopez-Gazpio I., Maritxalar M., Mihalcea R., et al., SemEval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability, in Proc. 9th Int. Workshop on Semantic Evaluation (SemEval 2015), Denver, CO, USA, 2015.
DOI
[12]
Šaric F., Glavaš G., Karan M., Šnajder J., and Bašic B. D., Takelab: Systems for measuring semantic text similarity, in Proc. 6th Int. Workshop on Semantic Evaluation, Montréal, Canada, 2012, pp. 441-448.
[13]
Bär D., Biemann C., Gurevych I., and Zesch T., UKP: Computing semantic textual similarity by combining multiple content similarity measures, in Proc. 1st Joint Conf. Lexical and Computational Semantics, Montréal, Canada, 2012.
[14]
Han W., Zhu X., Zhu Z., Chen W., Zheng W., and Lu J., A comparative analysis on weibo and twitter, Tsinghua Sci. Technol., vol. 21, no. 1, pp. 1-16, 2016.
[15]
Zhang M. Y., Qin B., Liu T., and Zheng M., Triple based background knowledge ranking for document enrichment, in Proc. COLING 2014, the 25th Int. Conf. Computational Linguistics: Technical Papers, Dublin, Ireland, 2014.
[16]
Schuhmacher M. and Ponzetto S. P., Knowledge-based graph document modeling, in Proc.7th ACM Int. Conf. Web Search and Data Mining, New York, NY, USA, 2014, pp. 543-552.
DOI
[17]
Ramage D., Rafferty A. N., and Manning C. D., Random walks for text semantic similarity, in Proc. 2009 Workshop on Graph-Based Methods for Natural Language Processing, Suntec, Singapore, 2009, pp. 23-31.
DOI
[18]
Zhang M. Y., Qin B., Zheng M., Hirst G., and Liu T., Encoding distributional semantics into triple-based knowledge ranking for document enrichment, in Proc. 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int. Joint Conf. Natural Language Processing, Beijing, China, 2015.
DOI
[19]
Salton G., Wong A., and Yang C. S., A vector space model for automatic indexing, Commun. ACM, vol. 11, no. 11, pp. 613-620, 1975.
[20]
Miller G. A., WordNet: A lexical database for English, Commun. ACM, vol. 38, no. 11, pp. 39-41, 1995.
[21]
Bollacker K., Evans C., Paritosh P., Sturge T., and Taylor J., Freebase: A collaboratively created graph database for structuring human knowledge, in Proc. 2008 ACM SIGMOD Int. Conf. Management of Data, Vancouver, Canada, 2008, pp. 1247-1250.
DOI
[22]
Landauer T. K., Foltz P. W., and Laham D., An introduction to latent semantic analysis, Dis. Process., vol. 25, nos. 2&3, pp. 259-284, 1998.
[23]
Wang D. Q., Zhang H., Liu R., Liu X. L., and Wang J., Unsupervised feature selection through Gram-Schmidt orthogonalization—A word co-occurrence perspective, Neurocomputing, vol. 173, pp. 845-854, 2016.
[24]
Blei D. M., Ng A. Y., and Jordan M. I., Latent dirichlet allocation, J. Mach. Learn. Res., vol. 3, pp. 993-1022, 2003.
[25]
Bengio Y., Schwenk H., Senécal J. S., Morin F., and Gauvain J. L., Neural probabilistic language models, in Innovations in Machine Learning, Holmes D. E. and Jain L. C., eds. Springer, 2006, pp. 137-186.
[26]
Le Q. V. and Mikolov T., Distributed representations of sentences and documents, arXiv preprint arXiv: 1405.4053, 2014.
[27]
Auer S., Bizer C., Kobilarov G., Lehmann J., Cyganiak R., and Ives Z., DBpedia: A nucleus for a web of open data, in The Semantic Web, Aberer K., Choi K. S., Noy N., Allemang D., Lee K. I., Nixon L., Golbeck J., Mika P., Maynard D., Mizoguchi R., et al., eds. Springer, 2007.
[28]
Pennington J., Socher R., and Manning C. D., GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014.
[29]
Gabrilovich E. and Markovitch S., Computing semantic relatedness using Wikipedia-based explicit semantic analysis, in Proc. 20th Int.Joint Conf. Artifical Intelligence, Hyderabad, India, 2007, pp. 1606-1611.
[30]
Rafi M. and Shaikh M. S., An improved semantic similarity measure for document clustering based on topic maps, arXiv preprint arXiv: 1303.4087, 2013.
[31]
Rus V., Niraula N., and Banjade R., Similarity measures based on latent Dirichlet allocation, in Computational Linguistics and Intelligent Text Processing, Gelbukh A., ed. Springer, 2013, pp. 459-470.
DOI
[32]
Rus V., Lintean M., Banjade R., Niraula N., and Stefanescu D., SEMILAR: The semantic similarity toolkit, in Proc. 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 163-168.
[33]
Wu Z. B. and Palmer M., Verb semantics and lexical selection, in Proc. 32nd Annual Meeting on Association for Computational Linguistics, Las Cruces, NM, USA, 1994, pp. 133-138.
[34]
Fried D. and Duh K., Incorporating both distributional and relational semantics in word representations, arXiv preprint arXiv: 1412.4369, 2014.
[35]
Yu M. and Dredze M., Improving lexical embeddings with semantic knowledge, in Proc. 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), Baltimore, MD, USA, 2014, pp. 545-550.
DOI
[36]
Radev D. R., Muthukrishnan P., and Qazvinian V., The ACL anthology network corpus, in Proc. 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, Suntec, Singapore, 2009, pp. 54-61.
DOI
[37]
Dolan B., Quirk C., and Brockett C., Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources, in Proc. 20th Int. Conf. Computational Linguistics, Geneva, Switzerland, 2004, p. 350.
DOI
[38]
Rus V., Lintean M., Moldovan C., Baggett W., Niraula N., and Morgan B., The SIMILAR corpus: A resource to foster the qualitative understanding of semantic similarity of texts, in Proc. 8th Language Resources and Evaluation Conf., Instanbul, Turkey, 2012, pp. 23-25.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 31 December 2016
Revised: 22 April 2017
Accepted: 14 June 2017
Published: 14 December 2017
Issue date: December 2017

Copyright

© The author(s) 2017

Acknowledgements

This research was supported by the Foundation of the State Key Laboratory of Software Development Environment (No. SKLSDE-2015ZX-04).

Rights and permissions

Return