Abstract

Word embedding is an essential building block for deep learning methods for natural language processing. Although word embedding has been extensively studied over the years, the problem of how to effectively embed numerals, a special subset of words, is still underexplored. Existing word embedding methods do not learn numeral embeddings well because there are an infinite number of numerals and their individual appearances in training corpora are highly scarce. In this paper, we propose two novel numeral embedding methods that can handle the out-of-vocabulary (OOV) problem for numerals. We first induce a finite set of prototype numerals using either a self-organizing map or a Gaussian mixture model. We then represent the embedding of a numeral as a weighted average of the prototype number embeddings. Numeral embeddings represented in this manner can be plugged into existing word embedding learning approaches such as skip-gram for training. We evaluated our methods and showed its effectiveness on four intrinsic and extrinsic tasks: word similarity, embedding numeracy, numeral prediction, and sequence labeling.

Highlights

  • Word embeddings have become an essential building block for deep learning approaches to natural language processing (NLP)

  • Our work differs from previous work in that we aim to produce general-purpose numeral embeddings that can be employed in any neural NLP approach

  • We describe how to integrate our numeral embeddings into traditional word embedding methods for training

Read more

Summary

Introduction

Word embeddings have become an essential building block for deep learning approaches to natural language processing (NLP). The quality of pretrained word embeddings has been shown to significantly impact the performance of neural approaches to a variety of NLP tasks. Over the past two decades, significant progress has been made in the development of word embedding techniques (Lund and Burgess, 1996; Bengio et al, 2003; Bullinaria and Levy, 2007; Mikolov et al, 2013b; Pennington et al, 2014). Existing word embedding methods do not handle numerals adequately and cannot directly encode the numeracy and magnitude of a numeral Naik et al (2019). Most numerals have very scarce appearances in training corpora and are more likely to be outof-vocabulary (OOV) compared to non-numerical words. Numerals account for 6.15% of all unique tokens in English Wikipedia, but in GloVe (Pennington et al, 2014) which is partially trained on Wikipedia, only 3.79% of its vocabulary is numerals

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call