Learning Numeral Embedding

Chengyue Jiang,Zhonglin Nian,Kewei Tu,Yinggong Zhao,Kaihao Guo,Shanbo Chu,Libin Shen

doi:10.18653/v1/2020.findings-emnlp.235

Abstract

Word embedding is an essential building block for deep learning methods for natural language processing. Although word embedding has been extensively studied over the years, the problem of how to effectively embed numerals, a special subset of words, is still underexplored. Existing word embedding methods do not learn numeral embeddings well because there are an infinite number of numerals and their individual appearances in training corpora are highly scarce. In this paper, we propose two novel numeral embedding methods that can handle the out-of-vocabulary (OOV) problem for numerals. We first induce a finite set of prototype numerals using either a self-organizing map or a Gaussian mixture model. We then represent the embedding of a numeral as a weighted average of the prototype number embeddings. Numeral embeddings represented in this manner can be plugged into existing word embedding learning approaches such as skip-gram for training. We evaluated our methods and showed its effectiveness on four intrinsic and extrinsic tasks: word similarity, embedding numeracy, numeral prediction, and sequence labeling.

Highlights

Word embeddings have become an essential building block for deep learning approaches to natural language processing (NLP)
Our work differs from previous work in that we aim to produce general-purpose numeral embeddings that can be employed in any neural NLP approach
We describe how to integrate our numeral embeddings into traditional word embedding methods for training

Summary

Introduction

Word embeddings have become an essential building block for deep learning approaches to natural language processing (NLP). The quality of pretrained word embeddings has been shown to significantly impact the performance of neural approaches to a variety of NLP tasks. Over the past two decades, significant progress has been made in the development of word embedding techniques (Lund and Burgess, 1996; Bengio et al, 2003; Bullinaria and Levy, 2007; Mikolov et al, 2013b; Pennington et al, 2014). Existing word embedding methods do not handle numerals adequately and cannot directly encode the numeracy and magnitude of a numeral Naik et al (2019). Most numerals have very scarce appearances in training corpora and are more likely to be outof-vocabulary (OOV) compared to non-numerical words. Numerals account for 6.15% of all unique tokens in English Wikipedia, but in GloVe (Pennington et al, 2014) which is partially trained on Wikipedia, only 3.79% of its vocabulary is numerals

Objectives

Methods

Results

Conclusion