Journal Home > Volume 27 , Issue 3

A tool for the manual annotation of cross-document entity and event coreferences that helps annotators to label mention coreference relations in text is essential for the annotation of coreference corpora. To the best of our knowledge, CROss-document Main Events and entities Recognition (CROMER) is the only open-source manual annotation tool available for cross-document entity and event coreferences. However, CROMER lacks multi-language support and extensibility. Moreover, to label cross-document mention coreference relations, CROMER requires the support of another intra-document coreference annotation tool known as Content Annotation Tool, which is now unavailable. To address these problems, we introduce Cross-Document Coreference Annotation Tool (CDCAT), a new multi-language open-source manual annotation tool for cross-document entity and event coreference, which can handle different input/output formats, preprocessing functions, languages, and annotation systems. Using this new tool, annotators can label a reference relation with only two mouse clicks. Best practice analyses reveal that annotators can reach an annotation speed of 0.025 coreference relations per second on a corpus with a coreference density of 0.076 coreference relations per word. As the first multi-language open-source cross-document entity and event coreference annotation tool, CDCAT can theoretically achieve higher annotation efficiency than CROMER.


menu
Abstract
Full text
Outline
About this article

CDCAT: A Multi-Language Cross-Document Entity and Event Coreference Annotation Tool

Show Author's information Yang XuBoming XiaYueliang WanFan ZhangJiabo XuHuansheng Ning( )
School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
Beijing Engineering Research Center for Cyberspace Data Analysis and Applications, Beijing 100083, China
Research Institute with Run Technologies Company, Ltd., Beijing 100192, China
School of Information Engineering, Xinjiang Institute of Engineering, Urumqi 830091, China

Abstract

A tool for the manual annotation of cross-document entity and event coreferences that helps annotators to label mention coreference relations in text is essential for the annotation of coreference corpora. To the best of our knowledge, CROss-document Main Events and entities Recognition (CROMER) is the only open-source manual annotation tool available for cross-document entity and event coreferences. However, CROMER lacks multi-language support and extensibility. Moreover, to label cross-document mention coreference relations, CROMER requires the support of another intra-document coreference annotation tool known as Content Annotation Tool, which is now unavailable. To address these problems, we introduce Cross-Document Coreference Annotation Tool (CDCAT), a new multi-language open-source manual annotation tool for cross-document entity and event coreference, which can handle different input/output formats, preprocessing functions, languages, and annotation systems. Using this new tool, annotators can label a reference relation with only two mouse clicks. Best practice analyses reveal that annotators can reach an annotation speed of 0.025 coreference relations per second on a corpus with a coreference density of 0.076 coreference relations per word. As the first multi-language open-source cross-document entity and event coreference annotation tool, CDCAT can theoretically achieve higher annotation efficiency than CROMER.

Keywords: natural language processing, event coreference, entity coreference, manual annotation tool

References(15)

[1]
S. Barhom, V. Shwartz, A. Eirew, M. Bugert, N. Reimers, and I. Dagan, Revisiting joint modeling of cross-document entity and event coreference resolution, in Proc. 57th Ann. Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 4179-4189.
DOI
[2]
H. J. Fan, Z. Y. Ma, H. Q. Li, D. S. Wang, and J. F. Liu, Enhanced answer selection in CQA using multi-dimensional features combination, Tsinghua Science and Technology, vol. 24, no. 3, pp. 346-359, 2019
[3]
Y. F. Gao, P. J. Li, I. King, and M. R. Lyu, Interconnected question generation with coreference alignment and conversation flow modeling, in Proc. 57th Ann. Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 4853-4862.
DOI
[4]
M. Liu, B. Lang, Z. P. Gu, and A. Zeeshan, Measuring similarity of academic articles with semantic profile and joint word embedding, Tsinghua Science and Technology, vol. 22, no. 6, pp. 619-632, 2017.
[5]
P. C. Ma, B. Jiang, Z. G. Lu, N. Li, and Z. W. Jiang, Cybersecurity named entity recognition using bidirectional long short-term memory with conditional random fields, Tsinghua Science and Technology, vol. 26, no. 3, pp. 259-265, 2021.
[6]
C. Walker, S. Strassel, J. Medero, and K. Maeda, ACE 2005 multilingual training corpus, https://catalog.ldc.upenn.edu/LDC2006T06, 2005.
[7]
S. D. Huang, S. Strassel, A. Mitchell, and Z. Y. Song, Shared resources for multilingual information extraction and challenges in named entity annotation, in Proc. 1st Int. Joint Conf. Natural Language Proc., Hainan, China, 2004, pp. 112-119.
[8]
N. Reimers and I. Gurevych, Event nugget detection, classification and coreference resolution using deep neural networks and gradient boosted decision trees, in Proc. 8th Text Analysis Conf., Gaithersburg, MD, USA, 2015.
[9]
J. Pustejovsky, J. M. Castano, R. Ingria, R. Sauri, R. J. Gaizauskas, A. Setzer, G. Katz, and D. R. Radev, Timeml: Robust specification of event and temporal expressions in text, in Proc. 5th Int. Workshop on Computational Semantics, Tilburg, Netherlands, 2003.
[10]
A. Cybulska and P. Vossen, Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution, in Proc. 9th Int. Conf. Language Resources and Evaluation, Reykjavik, Iceland, 2014, pp. 4545-4552.
[11]
C. A. Bejan and S. Harabagiu, Unsupervised event coreference resolution with rich linguistic features, in Proc. 48th Ann. Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010, pp. 1412-1422.
[12]
H. Lee, M. Recasens, A. Chang, M. Surdeanu, and D. Jurafsky, Joint entity and event coreference resolution across documents, in Proc. 2012 Joint Conf. Empirical Methods in Natural Language Proc. Computational Natural Language Learning, Jeju Island, Korea, 2012, pp. 489-500.
[13]
C. Girardi, M. Speranza, R. Sprugnoli, and S. Tonelli, Cromer: A tool for cross-document event and entity coreference, in Proc. 9th Int. Conf. Language Resources and Evaluation, Reykjavik, Iceland, 2014, pp. 3204-3208.
[14]
V. B. Lenzi, G. Moretti, and R. Sprugnoli, Cat: the celct annotation tool, in Proc. 8th Int. Conf. Language Resources and Evaluation, Istanbul, Turkey, 2012, pp. 333-338.
[15]
P. Stenetorp, S. Pyysalo, G. Topic, T. Ohta, S. Ananiadou, and J. Tsujii, Brat: A web-based tool for nlp-assisted text annotation, in Proc. 13th Conf. European Chapter of the Association for Computational Linguistics, Avignon, France, 2012, pp. 102-107.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 17 November 2020
Revised: 02 December 2020
Accepted: 17 December 2020
Published: 13 November 2021
Issue date: June 2022

Copyright

© The author(s) 2022

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 61872038), and the Fundamental Research Funds for the Central Universities (No. FRF-GF-19-020B).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return