Abstract

Anaphora resolution is a crucial task for information extraction. Syntax-based approaches are based on the syntactic structure of sentences. Knowledge-poor approaches aim at avoiding the need for further external resources or knowledge to carry out their task. This paper proposes a knowledge-poor, syntax-based approach to anaphora resolution in English texts. Our approach improves the traditional algorithm that is considered the standard baseline for comparison in the literature. Its most relevant contributions are in its ability to handle differently different kinds of anaphoras, and to disambiguate alternate associations using gender recognition of proper nouns. The former is obtained by refining the rules in the baseline algorithm, while the latter is obtained using a machine learning approach. Experimental results on a standard benchmark dataset used in the literature show that our approach can significantly improve the performance over the standard baseline algorithm used in the literature, and compares well also to the state-of-the-art algorithm that thoroughly exploits external knowledge. It is also efficient. Thus, we propose to use our algorithm as the new baseline in the literature.

Highlights

  • The current wide availability and continuous increase of digital documents, especially in textual form, makes it impossible to manually process them, except for a few selected and very important ones

  • Going beyond ‘simple’ information retrieval, typically based on some kind of lexical indexing of the texts, trying to understand a text’s content and distilling it so as to provide it to end users or to make it available for further automated processing is the task of the information extraction field of research, e.g., among other objectives, it would be extremely relevant and useful to be able to automatically extract the facts and relationships expressed in the text and formalize them into a knowledge base that can subsequently be consulted for many different purposes: answering queries whose answer is explicitly reported in the knowledge base, carrying out formal reasoning that infers information not explicitly reported in the knowledge base, etc

  • Expressing, respectively, the ratio of correct answers among the answers given and the ratio of correct answers over the real set of correct answers, in terms of parameters TP (True Positives, the number of items correctly retrieved), FP (False Positives, the number of items wrongly retrieved), FN (False Negatives, the number of items wrongly discarded) and TN (True Negatives, the number of items correctly discarded). These metrics require all correct answers for the dataset to be known, and they ignore the fact that, in Anaphora Resolution (AR), the queries themselves are not known in advance but the system itself is in charge of identifying the anaphoras

Read more

Summary

Introduction

The current wide availability and continuous increase of digital documents, especially in textual form, makes it impossible to manually process them, except for a few selected and very important ones. A cataphora (from Greek ‘carrying down’) is in some sense the ‘opposite’ of an anaphora [8]: whilst the latter references an entity located earlier in the text, the former references an entity that will be mentioned later in the discourse (typically in the same sentence). This kind of reference is more frequent in poetry, but can be found in common language.

Basics and Related Work
Anaphora and Anaphora Resolution
Pronominal Anaphora
Noun Phrases
Other Anaphoric References
Anaphora Resolution Algorithms
Hobbs’ Naïve Algorithm
Liang and Wu’s Approach
Evaluation
Proposed Algorithm
Gender Recognition
Implementation and Experimental Results
Gender Prediction
Anaphora Resolution Effectiveness and Efficiency
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call