How Do Neural Sequence Models Generalize? Local and Global Cues for Out-of-Distribution Prediction

D Anthony Bau,Jacob Andreas

doi:10.18653/v1/2021.emnlp-main.448

Abstract

After a neural sequence model encounters an unexpected token, can its behavior be predicted? We show that RNN and transformer language models exhibit structured, consistent generalization in out-of-distribution contexts. We begin by introducing two idealized models of generalization in next-word prediction: a lexical context model in which generalization is consistent with the last word observed, and a syntactic context model in which generalization is consistent with the global structure of the input. In experiments in English, Finnish, Mandarin, and random regular languages, we demonstrate that neural language models interpolate between these two forms of generalization: their predictions are well-approximated by a log-linear combination of lexical and syntactic predictive distributions. We then show that, in some languages, noise mediates the two forms of generalization: noise applied to input tokens encourages syntactic generalization, while noise in history representations encourages lexical generalization. Finally, we offer a preliminary theoretical explanation of these results by proving that the observed interpolation behavior is expected in log-linear models with a particular feature correlation structure. These results help explain the effectiveness of two popular regularization schemes and show that aspects of sequence model generalization can be understood and controlled.

Highlights

Generalization in count-based language models (LMs) Before the widespread use of neural approaches in NLP, statistical approaches to language modeling were typically defined by explicit independence assumptions governing their generalization in contexts never observed in the training data
This paper offers three steps toward such a characterization: 1. We present an empirical description of neural LM behavior in out-of-distribution contexts like the ones shown in (a–c)
Latent-variable language models based on finitestate machines (Kuhn et al, 1994) explicitly incorporate information from the long-range context by conditioning next-word prediction on abstract global states constrained by global sentence structure

Summary

Introduction

Generalization in count-based LMs Before the widespread use of neural approaches in NLP, statistical approaches to language modeling were typically defined by explicit independence assumptions governing their generalization in contexts never observed in the training data. Latent-variable language models based on finitestate machines (Kuhn et al, 1994) (or more expressive automata; Chelba and Jelinek 1998, Pauls and Klein 2012) explicitly incorporate information from the long-range context by conditioning next-word prediction on abstract global states constrained by global sentence structure. In models of both kinds, behavior in contexts unlike any seen at training time is be explicitly specified via backoff and smoothing schemes aimed at providing robust estimates of the frequency of rare events (Good, 1953; Katz, 1987; Kneser and Ney, 1995). The precise nature and limits of that generalization— especially its robustness to unusual syntax and its ability to incorporate information about global sentence structure—remain a topic of ongoing study

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

How Do Neural Sequence Models Generalize? Local and Global Cues for Out-of-Distribution Prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 1	License type: cc-by

Similar Papers

How Do Neural Sequence Models Generalize? Local and Global Cues for Out-of-Distribution Prediction
...
-
, et. al. ...
15 Oct 2021
15 Oct 2021

Improvements to N-gram Language Model Using Text Generated from Neural Language Model
Masayuki Suzuki ... Nobuyasu Itoh
-
Masayuki Suzuki, et. al.Masayuki Suzuki ... Nobuyasu Itoh
01 May 2019
01 May 2019

Joint unsupervised adaptation of n-gram and RNN language models via LDA-based hybrid mixture modeling
Ryo Masumura ... Taichi Asami
-
Ryo Masumura, et. al.Ryo Masumura ... Taichi Asami
01 Dec 2017
01 Dec 2017

Language Models With Meta-information

-

11 Mar 2014
11 Mar 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

How Do Neural Sequence Models Generalize? Local and Global Cues for Out-of-Distribution Prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers