WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER

Simone Tedeschi,Valentino Maiorca,Roberto Navigli,Niccolò Campolungo,Francesco Cecconi

doi:10.18653/v1/2021.findings-emnlp.215

Abstract

Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.

Highlights

1 Introduction form token matching, and category-based rules
We rely on Wikipedia text and its hypertext organization, we depart from previous works in our exploration of new languageindependent techniques for silver data creation for Named Entity Recognition (NER) by providing a general approach based on an effective combination of knowledge-based tech
NER is widely used in many downstream tasks, trained language models to produce highlike question answering (Mollá et al, 2006), maquality annotations for multilingual NER; chine translation (Babych and Hartley, 2003), information retrieval (Petkova and Croft, 2007), text summarization (Aone et al, 1998), text understanding (Zhang et al, 2019; Cheng and Erk, 2019) and entity linking (Tedeschi et al, 2021), among others

Summary

WikiNEuRal

Nothman et al (2013) introduced WikiNER, a pipeline to automatically create multilingual training data for NER by exploiting the structure and the texts of Wikipedia. They classified each Wikipedia document into named entity types, training and evaluating on manually-labeled Wikipedia. Wikipedia links were converted into labels by classifying the target articles into entity types (PER, ORG, LOC, MISC). Nothman et al (2013) showed that, when testing on manually-annotated Wikipedia sentences, models trained on gold-standard newswire datasets perform poorly compared to models trained on automatically-created Wikipedia corpora. Pan et al (2017) proposed WikiANN, a language-independent framework to automatically extract name mentions from documents by leveraging Wikipedia markups.

Preprocessing Wikipedia

Identifying Entity Mentions in Wikipedia

Tagging Named Entity Links Through Synsets

Improving Precision and Recall

Domain Adaptation

Domain embedding computation and domain extraction

Training Data

Test Data We use five different test sets in our experiments:

Results

Multilingual Evaluation

A Reproducibility Details

B Additional Results

C OntoNotes-to-CoNLL Class Mapping

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 12	License type: cc-by

Similar Papers

{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}
...
-
, et. al. ...
23 Oct 2021
23 Oct 2021

Firefly Algorithm Based Multilingual Named Entity Recognition for Indian Languages
Sitanath Biswas ... Sweta Acharya
-
Sitanath Biswas, et. al.Sitanath Biswas ... Sweta Acharya
12 Dec 2018
12 Dec 2018

Lightweight Multilingual Entity Extraction and Linking
Aasish Pappu ... Yashar Mehdad
-
Aasish Pappu, et. al.Aasish Pappu ... Yashar Mehdad
02 Feb 2017
02 Feb 2017

TLR at BSNLP2019: A Multilingual Named Entity Recognition System
Jose G Moreno ... Elvys Linhares Pontes
-
Jose G Moreno, et. al.Jose G Moreno ... Elvys Linhares Pontes
01 Jan 2019
TLR at BSNLP2019: A Multilingual Named Entity Recognition System
Jose G Moreno ... Elvys Linhares Pontes

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER

Abstract

Highlights

Summary

Talk to us

Similar Papers