Abstract

Abstract Named Entity Recognition (NER) is a fundamental NLP task, commonly formulated as classification over a sequence of tokens. Morphologically rich languages (MRLs) pose a challenge to this basic formulation, as the boundaries of named entities do not necessarily coincide with token boundaries, rather, they respect morphological boundaries. To address NER in MRLs we then need to answer two fundamental questions, namely, what are the basic units to be labeled, and how can these units be detected and classified in realistic settings (i.e., where no gold morphology is available). We empirically investigate these questions on a novel NER benchmark, with parallel token- level and morpheme-level NER annotations, which we develop for Modern Hebrew, a morphologically rich-and-ambiguous language. Our results show that explicitly modeling morphological boundaries leads to improved NER performance, and that a novel hybrid architecture, in which NER precedes and prunes morphological decomposition, greatly outperforms the standard pipeline, where morphological decomposition strictly precedes NER, setting a new performance bar for both Hebrew NER and Hebrew morphological decomposition tasks.

Highlights

  • Named Entity Recognition (NER) is a fundamental task in the area of Information Extraction (IE), in which mentions of Named Entities (NE) are extracted and classified in naturally occurring texts

  • Rich languages (MRL) (Tsarfaty et al, 2010; Seddah et al, 2013; Tsarfaty et al, 2020) are languages in which substantial information concerning the arrangement of words into phrases and the relations between them is expressed at the word level, rather than in a fixed word-order or a rigid structure

  • Explicit modeling of morphemes leads to better NER performance even when evaluated against token-level boundaries

Read more

Summary

Introduction

Named Entity Recognition (NER) is a fundamental task in the area of Information Extraction (IE), in which mentions of Named Entities (NE) are extracted and classified in naturally occurring texts. This task is most commonly formulated as a sequence labeling task, where extraction takes the form of assigning each input token with a label that marks the boundaries of the NE (e.g., B, I, O), and classification takes the form of assigning labels to indicate entity type (PER, ORG, LOC, etc.). While NER in English is formulated as the sequence labeling of space-delimited tokens, in MRLs a single token may include multiple meaning-bearing units, morphemes, only some of which are relevant for the entity mention at hand

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call