Abstract

Letter case information can impact named entity recognition (NER) by affecting the way entities are represented in the text. For instance, if proper nouns are capitalized, NER models can utilize this information to identify named entities. Although studies have analyzed the performance drop of NER models without capitalization information, it is not clear how different letter-case scenarios affect the NER performance for different types of data and domains. In this study, we examine the impact of different letter-case features on NER and investigate their effectiveness in improving NER system performance and robustness. The analysis of the effect of different letter-case scenarios on NER performance is performed for one domain and across multiple domains. The experimental results demonstrate that capitalization errors significantly affect the NER performance in both in- and cross-domain evaluation. The case-insensitive (BERT-base-uncased) model is more robust to inconsistencies in capitalization that may occur in noisy text data, whereas the case-sensitive (BERT-base-cased) model performs better on well-written text that provides clear case information. However, when the case-sensitive model is required in the application, we propose a simple data augmentation heuristic based on a letter case that improves the model’s robustness against capitalization errors commonly observed in user-generated text. Overall, our findings suggest that changing the style of the source domain to match that of the target domain can lead to better domain adaptation for NER and that the choice of the BERT model should consider the nature of the text data being analyzed. Our code and data for reproducing this work are available at https://github.com/daotuanan/Letter-case-NER .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call