Abstract

In some languages, Named Entity Recognition (NER) is severely hindered by complex linguistic structures, such as inflection, that will confuse the data-driven models when perceiving the word’s actual meaning. This work tries to alleviate these problems by introducing a novel neural network based on morphological and syntactic grammars. The experiments were performed in four Nordic languages, which have many grammar rules. The model was named the NorG network (Nor: Nordic Languages, G: Grammar). In addition to learning from the text content, the NorG network also learns from the word writing form, the POS tag, and dependency. The proposed neural network consists of a bidirectional Long Short-Term Memory (Bi-LSTM) layer to capture word-level grammars, while a bidirectional Graph Attention (Bi-GAT) layer is used to capture sentence-level grammars. Experimental results from four languages show that the grammar-assisted network significantly improves the results against baselines. We also investigate how the NorG network works on each grammar component by some exploratory experiments.

Highlights

  • Machine Learning models have widely applied Natural Language Processing (NLP)techniques, which replace the previous rule-based models and show better performances

  • Most leading Named Entity Recognition (NER) models are based on BERT [9], a type of word embedding pretrained by the Transformer architecture [25]

  • Our model was evaluated in the NorNE (Norwegian Bokmål), NorNE (Norwegian Nynorsk), DaNE (Danish), and Turku NER (Finnish) datasets whose linguistic structures are annotated in CONLL-U format

Read more

Summary

Introduction

Techniques, which replace the previous rule-based models and show better performances. Named Entity Recognition (NER) is a type of NLP technique based on machine learning models that extracts entities from sentences [2]. NER has seen considerable development in English, and many data-driven models have been proposed. Compared with English, some languages have many linguistic structures. Aiming at these grammar rules, this work proposes a grammar-based network for named entity recognition and selected four Nordic languages in experiments. (3) Experimental results demonstrate the effectiveness of the proposed method in four languages and some exploratory experiments were conducted to discover the influences of different grammar components on the NER performance.

Related Works
Materials and Methods
NorG Embedding
Bi-LSTM Layer
Bi-GAT Layer
CRF Layer
NER Datasets
Norwegian Bokmål and Nynorsk
Danish
Finnish
Baselines
Hyperparameters of the NorG Network
Results
Main Results
Ablation Experiments
Performance against Sentence Length
Performance on Automatically Obtained Grammars
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call