Abstract

Robustness to capitalization errors is a highly desirable characteristic of named entity recognizers, yet we find standard models for the task are surprisingly brittle to such noise.Existing methods to improve robustness to the noise completely discard given orthographic information, which significantly degrades their performance on well-formed text. We propose a simple alternative approach based on data augmentation, which allows the model to learn to utilize or ignore orthographic information depending on its usefulness in the context. It achieves competitive robustness to capitalization errors while making negligible compromise to its performance on well-formed text and significantly improving generalization power on noisy user-generated text. Our experiments clearly and consistently validate our claim across different types of machine learning models, languages, and dataset sizes.

Highlights

  • In the last two decades, substantial progress has been made on the task of named entity recognition (NER), as it has enjoyed the development of probabilistic modeling (Lafferty et al, 2001; Finkel et al, 2005), methodology (Ratinov and Roth, 2009), deep learning (Collobert et al, 2011; Huang et al, 2015; Lample et al, 2016) as well as semi-supervised learning (Peters et al, 2017, 2018)

  • While standard training data for the task consists mainly of well-formed text (Tjong Kim Sang, 2002; Pradhan and Xue, 2009), models trained on such data are often applied on a broad range of domains and genres by users who are not necessarily NLP experts, thanks to the proliferation of toolkits (Manning et al, 2014) and generalpurpose machine learning services

  • A text without correct capitalization is perfectly legible for human readers (Cattell, 1886; Rayner, 1975) with only a minor impact on the reading speed (Tinker and Paterson, 1928; Arditi and Cho, 2007), we show that typical NER models are surprisingly brittle to all-uppercasing or all-lowercasing of text

Read more

Summary

Introduction

In the last two decades, substantial progress has been made on the task of named entity recognition (NER), as it has enjoyed the development of probabilistic modeling (Lafferty et al, 2001; Finkel et al, 2005), methodology (Ratinov and Roth, 2009), deep learning (Collobert et al, 2011; Huang et al, 2015; Lample et al, 2016) as well as semi-supervised learning (Peters et al, 2017, 2018) Evaluation of these developments, has been mostly focused on their impact on global average metrics, most notably the microaveraged F1 score (Chinchor, 1992). We argue that an ideal approach should take a full advantage of orthographic information when it is correctly present, but rather than assuming the information to be always perfect, the model should be able to learn to ignore the orthographic information when it is unreliable To this end, we propose a novel approach based on data augmentation (Simard et al, 2003). Across a wide range of models (linear models, deep learning models to deep contextualized models), languages (English, German, Dutch, and Spanish), and dataset sizes (CoNLL 2003 and OntoNotes 5.0), the proposed method shows strong robustness while making little compromise to the performance on well-formed text

Formulation
Prior Work
Data Augmentation
Method
Experiments
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.