Interpolated Spectral NGram Language Models

Ariadna Quattoni,Xavier Carreras

doi:10.18653/v1/p19-1594

Abstract

Spectral models for learning weighted non-deterministic automata have nice theoretical and algorithmic properties. Despite this, it has been challenging to obtain competitive results in language modeling tasks, for two main reasons. First, in order to capture long-range dependencies of the data, the method must use statistics from long substrings, which results in very large matrices that are difficult to decompose. The second is that the loss function behind spectral learning, based on moment matching, differs from the probabilistic metrics used to evaluate language models. In this work we employ a technique for scaling up spectral learning, and use interpolated predictions that are optimized to maximize perplexity. Our experiments in character-based language modeling show that our method matches the performance of state-of-the-art ngram models, while being very fast to train.

Highlights

In the recent years we have witnessed the development of spectral methods based on matrix decompositions to learn Probabilistic Non-deterministic Finite Automata (PNFA) and related models (Hsu et al, 2009, 2012; Bailly et al, 2009; Balle et al, 2011; Cohen et al, 2012; Balle et al, 2014)
The spectral method is based on computing a Hankel matrix that contains statistics of expectations over substrings generated by the target language
Our experiments show that these two simple ideas bring us one step closer to making spectral methods for PNFA reach state-of-the-art performance on language modeling tasks (Section 4)

Summary

Introduction

In the recent years we have witnessed the development of spectral methods based on matrix decompositions to learn Probabilistic Non-deterministic Finite Automata (PNFA) and related models (Hsu et al, 2009, 2012; Bailly et al, 2009; Balle et al, 2011; Cohen et al, 2012; Balle et al, 2014). A consequence of this is that the Hankel matrix can become too large to make it practical to perform algebraic decompositions To address this problem we use the basis selection technique by Quattoni et al (2017) to scale spectral learning and model long range dependencies. There have been some proposals on generalizing the fundamental ideas of spectral learning to other loss functions (Parikh et al, 2014; Quattoni et al, 2014) While these approaches are promising they have the downside that they lead to relatively expensive iterative convex optimizations and it is still a challenge to scale them to model long-range dependencies. The main contribution of our work consists on combining two simple ideas, i.e. incorporating long-range dependencies via basis selection of long substring moments (Section 2), and refining the predictions of the PNFA with an iterative interpolation step (Section 3). In this paper we present experiments with one type of expectation and interpolation model that illustrates the potential of this approach

Probabilistic Non-Deterministic Finite Automata

The Spectral Method

Interpolated Predictions

Experiments

Findings

Conclusions