Robustness to Capitalization Errors in Named Entity Recognition

Sravan Bodapati,Hyokun Yun,Yaser Al-Onaizan

doi:10.18653/v1/d19-5531

Abstract

Robustness to capitalization errors is a highly desirable characteristic of named entity recognizers, yet we find standard models for the task are surprisingly brittle to such noise.Existing methods to improve robustness to the noise completely discard given orthographic information, which significantly degrades their performance on well-formed text. We propose a simple alternative approach based on data augmentation, which allows the model to learn to utilize or ignore orthographic information depending on its usefulness in the context. It achieves competitive robustness to capitalization errors while making negligible compromise to its performance on well-formed text and significantly improving generalization power on noisy user-generated text. Our experiments clearly and consistently validate our claim across different types of machine learning models, languages, and dataset sizes.

Highlights

In the last two decades, substantial progress has been made on the task of named entity recognition (NER), as it has enjoyed the development of probabilistic modeling (Lafferty et al, 2001; Finkel et al, 2005), methodology (Ratinov and Roth, 2009), deep learning (Collobert et al, 2011; Huang et al, 2015; Lample et al, 2016) as well as semi-supervised learning (Peters et al, 2017, 2018)
While standard training data for the task consists mainly of well-formed text (Tjong Kim Sang, 2002; Pradhan and Xue, 2009), models trained on such data are often applied on a broad range of domains and genres by users who are not necessarily NLP experts, thanks to the proliferation of toolkits (Manning et al, 2014) and generalpurpose machine learning services
A text without correct capitalization is perfectly legible for human readers (Cattell, 1886; Rayner, 1975) with only a minor impact on the reading speed (Tinker and Paterson, 1928; Arditi and Cho, 2007), we show that typical NER models are surprisingly brittle to all-uppercasing or all-lowercasing of text

Summary

Introduction

In the last two decades, substantial progress has been made on the task of named entity recognition (NER), as it has enjoyed the development of probabilistic modeling (Lafferty et al, 2001; Finkel et al, 2005), methodology (Ratinov and Roth, 2009), deep learning (Collobert et al, 2011; Huang et al, 2015; Lample et al, 2016) as well as semi-supervised learning (Peters et al, 2017, 2018) Evaluation of these developments, has been mostly focused on their impact on global average metrics, most notably the microaveraged F1 score (Chinchor, 1992). We argue that an ideal approach should take a full advantage of orthographic information when it is correctly present, but rather than assuming the information to be always perfect, the model should be able to learn to ignore the orthographic information when it is unreliable To this end, we propose a novel approach based on data augmentation (Simard et al, 2003). Across a wide range of models (linear models, deep learning models to deep contextualized models), languages (English, German, Dutch, and Spanish), and dataset sizes (CoNLL 2003 and OntoNotes 5.0), the proposed method shows strong robustness while making little compromise to the performance on well-formed text

Formulation

Prior Work

Data Augmentation

Method

Experiments

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Robustness to Capitalization Errors in Named Entity Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2019
Citations: 36	License type: cc-by

Similar Papers

May I Ask Who’s Calling? Named Entity Recognition on Call Center Transcripts for Privacy Law Compliance
Micaela Kaplan
-
Micaela KaplanMicaela Kaplan
01 Jan 2020
01 Jan 2020

Editor's evaluation: Multimodal brain age estimates relate to Alzheimer disease biomarkers and cognition in early stages: a cross-sectional observational study
Karla L Miller
-
Karla L MillerKarla L Miller
20 Oct 2022
20 Oct 2022

Decision letter: Multimodal brain age estimates relate to Alzheimer disease biomarkers and cognition in early stages: a cross-sectional observational study
James Cole ... Jeannie Chin
-
James Cole, et. al.James Cole ... Jeannie Chin
20 Oct 2022
20 Oct 2022

Author response: Multimodal brain age estimates relate to Alzheimer disease biomarkers and cognition in early stages: a cross-sectional observational study
Peter R Millar ... Ricardo F Allegri
-
Peter R Millar, et. al.Peter R Millar ... Ricardo F Allegri
28 Nov 2022
28 Nov 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Robustness to Capitalization Errors in Named Entity Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers