Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction

Daniela Gerz,Ivan Vulić,Anna Korhonen,Jason Naradowsky,Edoardo Ponti,Roi Reichart

doi:10.1162/tacl_a_00032

Abstract

Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available.

Highlights

1Chinese, Japanese, and Thai are sourced from Wikipedia and processed with the Polyglot tokeniser since we found their preprocessing in the Polyglot Wikipedia (PW) is not adequate for language modeling
We present a method for fine-tuning the output matrix M w within the Char-convolutional neural network (CNN)-Long-Short-Term Memory networks (LSTMs) language models (LMs) framework
We have presented a comprehensive language modeling study over a set of 50 typologically diverse languages

Summary

Introduction

A traditional recurrent neural network (RNN) LM setup operates on a limited closed vocabulary of words (Bengio et al, 2003; Mikolov et al, 2010). The limitation arises due to the model learning parameters exclusive to single words. A standard training procedure for neural LMs gradually modifies the parameters based on contextual/distributional information: each occurrence of a word token in training data contributes to the estimate of a word vector (i.e., model parameters) assigned to this word type. Low-frequency words often have incorrect estimates, not having moved far from their random initialisation. A common strategy for dealing with this issue is to exclude the low-quality parameters from the model (i.e., to replace them with the placeholder), leading to only a subset of the vocabulary being represented by the model

Objectives

Methods

Results

Conclusion