Abstract

Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available.

Highlights

  • 1Chinese, Japanese, and Thai are sourced from Wikipedia and processed with the Polyglot tokeniser since we found their preprocessing in the Polyglot Wikipedia (PW) is not adequate for language modeling

  • We present a method for fine-tuning the output matrix M w within the Char-convolutional neural network (CNN)-Long-Short-Term Memory networks (LSTMs) language models (LMs) framework

  • We have presented a comprehensive language modeling study over a set of 50 typologically diverse languages

Read more

Summary

Introduction

A traditional recurrent neural network (RNN) LM setup operates on a limited closed vocabulary of words (Bengio et al, 2003; Mikolov et al, 2010). The limitation arises due to the model learning parameters exclusive to single words. A standard training procedure for neural LMs gradually modifies the parameters based on contextual/distributional information: each occurrence of a word token in training data contributes to the estimate of a word vector (i.e., model parameters) assigned to this word type. Low-frequency words often have incorrect estimates, not having moved far from their random initialisation. A common strategy for dealing with this issue is to exclude the low-quality parameters from the model (i.e., to replace them with the placeholder), leading to only a subset of the vocabulary being represented by the model

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call