Morphosyntactically-Informed Coreference Resolution for Persian with Adaptive Pruning and Global Context Aggregation
Coreference resolution in Persian, a task critical to natural language understanding, presents unique challenges due to the language's pro-drop tendencies, flexible word order, and rich morphosyntactic agreement system. This study introduces the first end-to-end (e-2-e) neural architecture for comprehensive Persian coreference resolution, encompassing pronominal, nominal, and named entity mentions. The system leverages ParsBERT and innovatively integrates Joint Mention Detection and Type Classification (JMDTC), an Adaptive Antecedent Pruning Threshold (AAPT), Morphosyntactically-Informed Attention (MIA), and Cross-Segment Coreference with Global Context Aggregation (CS-GCA). By jointly optimizing mention detection and antecedent linking, the system surpasses traditional pipelined approaches, eliminating the need for handcrafted features and complex syntactic parsers. A CoNLL average F1-score of 76.16% was achieved by the system on the Mehr corpus, which represents a 4.03-point improvement compared to the previous state-of-the-art. Furthermore, it demonstrates robust generalization, achieving a CoNLL average F1-score of 74.20% on the RCDAT corpus (evaluated using the Uppsala test set). These findings facilitate scalable coreference resolution in low-resource languages presenting similar morphosyntactic challenges.
- Research Article
1
- 10.1186/s12911-022-01862-1
- Apr 30, 2022
- BMC Medical Informatics and Decision Making
BackgroundBio-entity Coreference Resolution (CR) is a vital task in biomedical text mining. An important issue in CR is the differential representation of identical mentions as their similar representations may make the coreference more puzzling. However, when extracting features, existing neural network-based models may bring additional noise to the distinction of identical mentions since they tend to get similar or even identical feature representations.MethodsWe propose a context-aware feature attention model to distinguish similar or identical text units effectively for better resolving coreference. The new model can represent the identical mentions based on different contexts by adaptively exploiting features, which enables the model reduce the text noise and capture the semantic information effectively.ResultsThe experimental results show that the proposed model brings significant improvements on most of the baseline for coreference resolution and mention detection on the BioNLP dataset and CRAFT-CR dataset. The empirical studies further demonstrate its superior performance on the differential representation and coreferential link of identical mentions.ConclusionsIdentical mentions impose difficulties on the current methods of Bio-entity coreference resolution. Thus, we propose the context-aware feature attention model to better distinguish identical mentions and achieve superior performance on both coreference resolution and mention detection, which will further improve the performance of the downstream tasks.
- Research Article
1
- 10.1109/tcbb.2024.3447273
- Nov 1, 2024
- IEEE/ACM transactions on computational biology and bioinformatics
Biomedical Coreference Resolution focuses on identifying the coreferences in biomedical texts, which normally consists of two parts: (i) mention detection to identify textual representation of biological entities and (ii) finding their coreference links. Recently, a popular approach to enhance the task is to embed knowledge base into deep neural networks. However, the way in which these methods integrate knowledge leads to the shortcoming that such knowledge may play a larger role in mention detection than coreference resolution. Specifically, they tend to integrate knowledge prior to mention detection, as part of the embeddings. Besides, they primarily focus on mention-dependent knowledge (KBase), i.e., knowledge entities directly related to mentions, while ignores the correlated knowledge (K+) between mentions in the mention-pair. For mentions with significant differences in word form, this may limit their ability to extract potential correlations between those mentions. Thus, this paper develops a novel model to integrate both KBase and K+ entities and achieves the state-of-the-art performance on BioNLP and CRAFT-CR datasets. Empirical studies on mention detection with different length reveals the effectiveness of the KBase entities. The evaluation on cross-sentence and match/mismatch coreference further demonstrate the superiority of the K+ entities in extracting background potential correlation between mentions.
- Research Article
- 10.37547/ajps/volume05issue07-21
- Jul 1, 2025
- American Journal of Philological Sciences
Coreference resolution plays a crucial role in natural language processing by enabling accurate understanding of a text and identifying its semantic structure. While effective Coreference resolution systems have been developed for resource-rich languages such as English, German, and Chinese, research and practical systems in this area remain insufficient for the Uzbek language. Uzbek differs significantly from other languages due to its agglutinative structure, flexible word order, and rich morphology. These linguistic features necessitate unique approaches and models for Coreference resolution. This article discusses the Uzbek-language Coreference resolution system – UzCoref – highlighting its functional capabilities, system architecture, data flow, underlying model, testing process, comparative analysis with other systems, and the advantages of UzCoref.
- Conference Article
48
- 10.18653/v1/k15-1002
- Jan 1, 2015
In coreference resolution, a fair amount of research treats mention detection as a preprocessed step and focuses on developing algorithms for clustering coreferred mentions. However, there are significant gaps between the performance on gold mentions and the performance on the real problem, when mentions are predicted from raw text via an imperfect Mention Detection (MD) module. Motivated by the goal of reducing such gaps, we develop an ILP-based joint coreference resolution and mention head formulation that is shown to yield significant improvements on coreference from raw text, outperforming existing state-ofart systems on both the ACE-2004 and the CoNLL-2012 datasets. At the same time, our joint approach is shown to improve mention detection by close to 15% F1. One key insight underlying our approach is that identifying and co-referring mention heads is not only sufficient but is more robust than working with complete mentions.
- Research Article
1
- 10.3233/jifs-201050
- Jan 1, 2020
- Journal of Intelligent & Fuzzy Systems
Coreference resolution is critical for improving the performance of all text-based systems including information extraction, document summarization, machine translation, and question-answering. Most of coreference resolution solutions rely on using knowledge resources like lexical knowledge, syntactic knowledge, world knowledge and semantic knowledge. This paper presents a new knowledge-based coreference resolution model using neural network architecture. It uses XLNet embeddings as input and does not rely on any syntactic or dependency parsers. For more efficient span representation and mention detection, we used entity-level information. Mentions were extracted from the text with an unhand engineered mention detector, and the features were extracted from a deep neural network. We also propose a nonlinear multi-criteria ranking model to rank the candidate antecedents. This model simultaneously determines the total score of alternatives and the weight of the features in order to speed up the process of ranking alternatives. Compared to the state-of-the-art models, the simulation results showed significant improvements on the English CoNLL-2012 shared task (+6.4 F1). Moreover, we achieved 96.1% F1 score on the n2c2 medical dataset.
- Research Article
- 10.1145/3700821
- Oct 22, 2024
- ACM Transactions on Asian and Low-Resource Language Information Processing
Mention detection is an important component of the Coreference Resolution (CR) system, where mentions such as name, nominal, and pronominals are identified. These mentions can be purely coreferential mentions or singleton mentions (non-coreferential mentions). Coreferential mentions are those mentions in a text that refer to the same entities in the real world. Whereas, singleton mentions are mentioned only once in the text and do not participate in the coreference as they are not mentioned again in the following text. Filtering of these singleton mentions can substantially improve the performance of a CR process. This paper proposes a singleton mention detection module based on a Fully Connected Network (FCN) and a Long Short-Term Memory for Hindi text and model identifying singleton mentions so that these mentions can be filtered out to reduce the search space for CR. A CR system can look for the previous reference of that mention in the text and if these mentions are removed from the list of mentions, then it reduces the searching time and also space time. This model utilizes a few hand-crafted features, context information, and embedding for words from word2vec and a multilingual Bidirectional Encoder Representations from Transformers (mBERT) language model. The coreference annotated Hindi dataset comprising 3.6K sentences, and 78K tokens are used for the task. The singleton mention detection model is analyzed extensively by experimenting with various lengths of context windows for each mention. The performance of the model is significant with two window sizes of context as compared to other various window sizes of contexts such as 2,3,4,5, etc., and all previous and all next words of each mention. The Precision, Recall, and F-measure of the LSTM-FCN model with mBERT (Word + Context + Syntactic) with two window sizes of context for identifying the singleton mentions are 63%, 71%, and 67% respectively.
- Book Chapter
2
- 10.1007/978-3-319-03674-8_15
- Jan 1, 2014
Natural Language Processing (NLP) includes Tasks such as Information Extraction (IE), text summarization, and question and answering, all of which require identifying all the information about an entity exists in the discourse. Therefore a system capable of studying Co-reference Resolution (CR) will contribute to the successful completion of these Tasks. In this paper we are going to study process of Co-reference Resolution and represent a system capable of identifying Co-reference mentions for first the time in Farsi corpora. So we should consider three main steps of Farsi Corpus with Co-reference annotation, system of Mention Recognition and its domain, and the algorithm of predicting Co-reference Mentions as the basis of our study. Therefore, in first step, we prepare a Corpus with suitable labels, and this Corpus as first Farsi corpus having Mention and Co-reference labels can be the basis of many researches related to mention Detection (MD) and CR. Also using such corpus and studying rules and priorities among the mentions, we present a system that identifies the mentions and negative and positive examples. Then by using learning algorithm such as SVM, Neural Network and Decision Tree on extracted samples we have evaluated models for predicting Co-reference mentions in Farsi Language. Finally, we conclude that the performance of neural network is better than other learners.KeywordsCo-reference ResolutionMention DetectionSVMNeural Network and Decision TreeFarsi Corpus
- Conference Article
58
- 10.18653/v1/p18-2017
- Jan 1, 2018
Coreference resolution aims to identify in a text all mentions that refer to the same real world entity. The state-of-the-art end-to-end neural coreference model considers all text spans in a document as potential mentions and learns to link an antecedent for each possible mention. In this paper, we propose to improve the end-to-end coreference resolution system by (1) using a biaffine attention model to get antecedent scores for each possible mention, and (2) jointly optimizing the mention detection accuracy and mention clustering accuracy given the mention cluster labels. Our model achieves the state-of-the-art performance on the CoNLL-2012 shared task English test set.
- Research Article
- 10.1016/j.csl.2024.101681
- Jun 18, 2024
- Computer Speech & Language
Enhancing Turkish Coreference Resolution: Insights from deep learning, dropped pronouns, and multilingual transfer learning
- Research Article
1
- 10.1017/s1351324924000019
- Jan 25, 2024
- Natural Language Engineering
Coreference resolution is the task of identifying and clustering mentions that refer to the same entity in a document. Based on state-of-the-art deep learning approaches, end-to-end coreference resolution considers all spans as candidate mentions and tackles mention detection and coreference resolution simultaneously. Recently, researchers have attempted to incorporate document-level context using higher-order inference (HOI) to improve end-to-end coreference resolution. However, HOI methods have been shown to have marginal or even negative impact on coreference resolution. In this paper, we reveal the reasons for the negative impact of HOI coreference resolution. Contextualized representations (e.g., those produced by BERT) for building span embeddings have been shown to be highly anisotropic. We show that HOI actually increases and thus worsens the anisotropy of span embeddings and makes it difficult to distinguish between related but distinct entities (e.g., pilots and flight attendants). Instead of using HOI, we propose two methods, Less-Anisotropic Internal Representations (LAIR) and Data Augmentation with Document Synthesis and Mention Swap (DSMS), to learn less-anisotropic span embeddings for coreference resolution. LAIR uses a linear aggregation of the first layer and the topmost layer of contextualized embeddings. DSMS generates more diversified examples of related but distinct entities by synthesizing documents and by mention swapping. Our experiments show that less-anisotropic span embeddings improve the performance significantly (+2.8 F1 gain on the OntoNotes benchmark) reaching new state-of-the-art performance on the GAP dataset.
- Dissertation
1
- 10.11588/heidok.00023305
- Jan 1, 2017
Coreference resolution is the task of determining which expressions in a text are used to refer to the same entity. This task is one of the most fundamental problems of natural language understanding. Inherently, coreference resolution is a structured task, as the output consists of sets of coreferring expressions. This complex structure poses several challenges since it is not clear how to account for the structure in terms of error analysis and representation. In this thesis, we present a treatment of computational coreference resolution that accounts for the structure. Our treatment encompasses error analysis and the representation of approaches to coreference resolution. In particular, we propose two frameworks in this thesis. The first framework deals with error analysis. We gather requirements for an appropriate error analysis method and devise a framework that considers a structured graph-based representation of the reference annotation and the system output. Error extraction is performed by constructing linguistically motivated or data-driven spanning trees for the graph-based coreference representations. The second framework concerns the representation of approaches to coreference resolution. We show that approaches to coreference resolution can be understood as predictors of latent structures that are not annotated in the data. From these latent structures, the final output is derived during a post-processing step. We devise a machine learning framework for coreference resolution based on this insight. In this framework, we have a unified representation of approaches to coreference resolution. Individual approaches can be expressed as instantiations of a generic approach. We express many approaches from the literature as well as novel variants in our framework, ranging from simple pairwise classification approaches to complex entity-centric models. Using the uniform representation, we are able to analyze differences and similarities between the models transparently and in detail. Finally, we employ the error analysis framework to perform a qualitative analysis of differences in error profiles of the models on a benchmark dataset. We trace back differences in the error profiles to differences in the representation. Our analysis shows that a mention ranking model and a tree-based mention-entity model with left-to-right inference have the highest performance. We discuss reasons for the improved performance and analyze why more advanced approaches modeled in our framework cannot improve on these models. An implementation of the frameworks discussed in this thesis is publicly available.
- Conference Article
3
- 10.18653/v1/2021.crac-1.1
- Jan 1, 2021
Pronoun Coreference Resolution (PCR) is the task of resolving pronominal expressions to all mentions they refer to. Compared with the general coreference resolution task, the main challenge of PCR is the coreference relation prediction rather than the mention detection. As one important natural language understanding (NLU) component, pronoun resolution is crucial for many downstream tasks and still challenging for existing models, which motivates us to survey existing approaches and think about how to do better. In this survey, we first introduce representative datasets and models for the ordinary pronoun coreference resolution task. Then we focus on recent progress on hard pronoun coreference resolution problems (e.g., Winograd Schema Challenge) to analyze how well current models can understand commonsense. We conduct extensive experiments to show that even though current models are achieving good performance on the standard evaluation set, they are still not ready to be used in real applications (e.g., all SOTA models struggle on correctly resolving pronouns to infrequent objects). All experiment codes will be available upon acceptance.
- Research Article
14
- 10.1186/1471-2105-16-s10-s6
- Jun 23, 2015
- BMC Bioinformatics
BackgroundThe acquisition of knowledge about relations between bacteria and their locations (habitats and geographical locations) in short texts about bacteria, as defined in the BioNLP-ST 2013 Bacteria Biotope task, depends on the detection of co-reference links between mentions of entities of each of these three types. To our knowledge, no participant in this task has investigated this aspect of the situation. The present work specifically addresses issues raised by this situation: (i) how to detect these co-reference links and associated co-reference chains; (ii) how to use them to prepare positive and negative examples to train a supervised system for the detection of relations between entity mentions; (iii) what context around which entity mentions contributes to relation detection when co-reference chains are provided.ResultsWe present experiments and results obtained both with gold entity mentions (task 2 of BioNLP-ST 2013) and with automatically detected entity mentions (end-to-end system, in task 3 of BioNLP-ST 2013). Our supervised mention detection system uses a linear chain Conditional Random Fields classifier, and our relation detection system relies on a Logistic Regression (aka Maximum Entropy) classifier. They use a set of morphological, morphosyntactic and semantic features. To minimize false inferences, co-reference resolution applies a set of heuristic rules designed to optimize precision. They take into account the types of the detected entity mentions, and take advantage of the didactic nature of the texts of the corpus, where a large proportion of bacteria naming is fairly explicit (although natural referring expressions such as "the bacteria" are common). The resulting system achieved a 0.495 F-measure on the official test set when taking as input the gold entity mentions, and a 0.351 F-measure when taking as input entity mentions predicted by our CRF system, both of which are above the best BioNLP-ST 2013 participant system.ConclusionsWe show that co-reference resolution substantially improves over a baseline system which does not use co-reference information: about 3.5 F-measure points on the test corpus for the end-to-end system (5.5 points on the development corpus) and 7 F-measure points on both development and test corpora when gold mentions are used. While this outperforms the best published system on the BioNLP-ST 2013 Bacteria Biotope dataset, we consider that it provides mostly a stronger baseline from which more work can be started. We also emphasize the importance and difficulty of designing a comprehensive gold standard co-reference annotation, which we explain is a key point to further progress on the task.
- Research Article
1
- 10.3390/app13169272
- Aug 15, 2023
- Applied Sciences
There are several possibilities to improve classification in natural language processing tasks. In this article, we focused on the issue of coreference resolution that was applied to a manually annotated dataset of true and fake news. This dataset was used for the classification task of fake news detection. The research aimed to determine whether performing coreference resolution on the input data before classification or classifying them without performing coreference resolution is more effective. We also wanted to verify whether it is possible to enhance classifier performance metrics by incorporating coreference resolution into the data preparation process. A methodology was proposed, in which we described the implementation methods in detail, starting from the identification of entity mentions in the text using the neuralcoref algorithm, then through word-embedding models (TF–IDF, Doc2Vec), and finally to several machine learning methods. The result was a comparison of the implemented classifiers based on the performance metrics described in the theoretical part. The best result for accuracy was observed for the dataset with coreference resolution applied, which had a median value of 0.8149, while for the F1 score, the best result had a median value of 0.8101. However, the more important finding is that the processed data with the application of coreference resolution led to an improvement in performance metrics in the classification tasks.
- Conference Article
41
- 10.3115/1621787.1621800
- Jan 1, 2005
Arabic presents an interesting challenge to natural language processing, being a highly inflected and agglutinative language. In particular, this paper presents an in-depth investigation of the entity detection and recognition (EDR) task for Arabic. We start by highlighting why segmentation is a necessary prerequisite for EDR, continue by presenting a finite-state statistical segmenter, and then examine how the resulting segments can be better included into a mention detection system and an entity recognition system; both systems are statistical, build around the maximum entropy principle. Experiments on a clearly stated partition of the ACE 2004 data show that stem-based features can significantly improve the performance of the EDT system by 2 absolute F-measure points. The system presented here had a competitive performance in the ACE 2004 evaluation.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.