Building Language Models for Text with Named Entities

Md Rizwan Parvez,Baishakhi Ray,Saikat Chakraborty,Kai-Wei Chang

doi:10.18653/v1/p18-1221

Abstract

Text in many domains involves a significant amount of named entities. Predicting the entity names is often challenging for a language model as they appear less frequent on the training corpus. In this paper, we propose a novel and effective approach to building a language model which can learn the entity names by leveraging their entity type information. We also introduce two benchmark datasets based on recipes and Java programming codes, on which we evaluate the proposed model. Experimental results show that our model achieves 52.2% better perplexity in recipe generation and 22.06% on code generation than state-of-the-art language models.

Highlights

Language model is a fundamental component in Natural Language Processing (NLP) and it supports various applications, including document generation (Wiseman et al, 2017), text autocompletion (Arnold et al, 2017), spelling correction (Brill and Moore, 2000), and many others
Our experiments show that state-ofthe-art language models are, in general, good to learn the frequent words with enough training instances, they perform poorly on the entity names
We evaluate our proposed model on two different language generation tasks where there exist a lot of entity names in the text

Summary

Introduction

Language model is a fundamental component in Natural Language Processing (NLP) and it supports various applications, including document generation (Wiseman et al, 2017), text autocompletion (Arnold et al, 2017), spelling correction (Brill and Moore, 2000), and many others. (Hindle et al, 2016; Yin and Neubig, 2017; Hellendoorn and Devanbu, 2017; Rabinovich et al, 2017). These models have improved the language generation tasks to a great extent, e.g., (Mikolov et al, 2010; Galley et al, 2015). There are numerous similar, yet slightly different cooking ingredients (e.g., olive oil, canola oil, grape oil, etc.—all are different varieties of oil). Such diverse vocabularies of the ingredient names hinder the language model from predicting them properly

Objectives

Methods

Results

Conclusion