Coreference Annotation Research Articles

BackgroundCoreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations.ResultsThe corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus.ConclusionsThe project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large.

Read full abstract

BackgroundThe acquisition of knowledge about relations between bacteria and their locations (habitats and geographical locations) in short texts about bacteria, as defined in the BioNLP-ST 2013 Bacteria Biotope task, depends on the detection of co-reference links between mentions of entities of each of these three types. To our knowledge, no participant in this task has investigated this aspect of the situation. The present work specifically addresses issues raised by this situation: (i) how to detect these co-reference links and associated co-reference chains; (ii) how to use them to prepare positive and negative examples to train a supervised system for the detection of relations between entity mentions; (iii) what context around which entity mentions contributes to relation detection when co-reference chains are provided.ResultsWe present experiments and results obtained both with gold entity mentions (task 2 of BioNLP-ST 2013) and with automatically detected entity mentions (end-to-end system, in task 3 of BioNLP-ST 2013). Our supervised mention detection system uses a linear chain Conditional Random Fields classifier, and our relation detection system relies on a Logistic Regression (aka Maximum Entropy) classifier. They use a set of morphological, morphosyntactic and semantic features. To minimize false inferences, co-reference resolution applies a set of heuristic rules designed to optimize precision. They take into account the types of the detected entity mentions, and take advantage of the didactic nature of the texts of the corpus, where a large proportion of bacteria naming is fairly explicit (although natural referring expressions such as "the bacteria" are common). The resulting system achieved a 0.495 F-measure on the official test set when taking as input the gold entity mentions, and a 0.351 F-measure when taking as input entity mentions predicted by our CRF system, both of which are above the best BioNLP-ST 2013 participant system.ConclusionsWe show that co-reference resolution substantially improves over a baseline system which does not use co-reference information: about 3.5 F-measure points on the test corpus for the end-to-end system (5.5 points on the development corpus) and 7 F-measure points on both development and test corpora when gold mentions are used. While this outperforms the best published system on the BioNLP-ST 2013 Bacteria Biotope dataset, we consider that it provides mostly a stronger baseline from which more work can be started. We also emphasize the importance and difficulty of designing a comprehensive gold standard co-reference annotation, which we explain is a key point to further progress on the task.

Read full abstract

Coreference Annotation Research Articles

Related Topics

Articles published on Coreference Annotation

Code Book for the Annotation of Diverse Cross-Document Coreference of Entities in News Articles

Improving completeness and consistency of co-reference annotation standard

CDCAT: A multi-language cross-document entity and event coreference annotation tool

Decomposing and Recomposing Event Structure

Analysis of the Full-Size Russian Corpus of Internet Drug Reviews with Complex NER Labeling Using Deep Learning Neural Networks and Language Models

Corref-PT: A Semi-Automatic Annotated Portuguese Coreference Corpus

Adjudication of coreference annotations via answer set optimisation

Studying text coherence in Czech – a corpus-based analysis

Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles

Semantic Annotation of Anaphoric Links in Language

Coreference Annotation in the Russian Clinical Pear Stories Corpus: Annotation Features and Preliminary Results

Bio-SCoRes: A Smorgasbord Architecture for Coreference Resolution in Biomedical Text.

The GUM corpus: creating multilayer resources in the classroom

The contribution of co-reference resolution to supervised relation detection between bacteria and biotopes entities.

<tiger2/>: serialising the ISO SynAF syntactic object model

Chinese Overt Pronoun Resolution: A Bilingual Approach

Identity, non-identity, and near-identity: Addressing the complexity of coreference

Gesture Salience as a Hidden Variable for Coreference Resolution and Keyframe Extraction

Text Readability and Coreference Annotation across Heterogeneous Media for the Digital Archive of Rare Books

Open Ontology Forge: A Tool for Ontology Creation and Text Annotation Applied to the Biomedical Domain

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Coreference Annotation Research Articles

Related Topics

Articles published on Coreference Annotation

Code Book for the Annotation of Diverse Cross-Document Coreference of Entities in News Articles

Improving completeness and consistency of co-reference annotation standard

CDCAT: A multi-language cross-document entity and event coreference annotation tool

Decomposing and Recomposing Event Structure

Analysis of the Full-Size Russian Corpus of Internet Drug Reviews with Complex NER Labeling Using Deep Learning Neural Networks and Language Models

Corref-PT: A Semi-Automatic Annotated Portuguese Coreference Corpus

Adjudication of coreference annotations via answer set optimisation

Studying text coherence in Czech – a corpus-based analysis

Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles

Semantic Annotation of Anaphoric Links in Language

Coreference Annotation in the Russian Clinical Pear Stories Corpus: Annotation Features and Preliminary Results

Bio-SCoRes: A Smorgasbord Architecture for Coreference Resolution in Biomedical Text.

The GUM corpus: creating multilayer resources in the classroom

The contribution of co-reference resolution to supervised relation detection between bacteria and biotopes entities.

&lt;tiger2/&gt;: serialising the ISO SynAF syntactic object model

Chinese Overt Pronoun Resolution: A Bilingual Approach

Identity, non-identity, and near-identity: Addressing the complexity of coreference

Gesture Salience as a Hidden Variable for Coreference Resolution and Keyframe Extraction

Text Readability and Coreference Annotation across Heterogeneous Media for the Digital Archive of Rare Books

Open Ontology Forge: A Tool for Ontology Creation and Text Annotation Applied to the Biomedical Domain

<tiger2/>: serialising the ISO SynAF syntactic object model