Abstract

Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.

Highlights

  • 1 Introduction form token matching, and category-based rules

  • We rely on Wikipedia text and its hypertext organization, we depart from previous works in our exploration of new languageindependent techniques for silver data creation for Named Entity Recognition (NER) by providing a general approach based on an effective combination of knowledge-based tech

  • NER is widely used in many downstream tasks, trained language models to produce highlike question answering (Mollá et al, 2006), maquality annotations for multilingual NER; chine translation (Babych and Hartley, 2003), information retrieval (Petkova and Croft, 2007), text summarization (Aone et al, 1998), text understanding (Zhang et al, 2019; Cheng and Erk, 2019) and entity linking (Tedeschi et al, 2021), among others

Read more

Summary

WikiNEuRal

Nothman et al (2013) introduced WikiNER, a pipeline to automatically create multilingual training data for NER by exploiting the structure and the texts of Wikipedia. They classified each Wikipedia document into named entity types, training and evaluating on manually-labeled Wikipedia. Wikipedia links were converted into labels by classifying the target articles into entity types (PER, ORG, LOC, MISC). Nothman et al (2013) showed that, when testing on manually-annotated Wikipedia sentences, models trained on gold-standard newswire datasets perform poorly compared to models trained on automatically-created Wikipedia corpora. Pan et al (2017) proposed WikiANN, a language-independent framework to automatically extract name mentions from documents by leveraging Wikipedia markups.

Preprocessing Wikipedia
Identifying Entity Mentions in Wikipedia
Tagging Named Entity Links Through Synsets
Improving Precision and Recall
Domain Adaptation
Domain embedding computation and domain extraction
Training Data
Test Data We use five different test sets in our experiments:
Results
Multilingual Evaluation
A Reproducibility Details
B Additional Results
C OntoNotes-to-CoNLL Class Mapping
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.