Abstract
After a neural sequence model encounters an unexpected token, can its behavior be predicted? We show that RNN and transformer language models exhibit structured, consistent generalization in out-of-distribution contexts. We begin by introducing two idealized models of generalization in next-word prediction: a lexical context model in which generalization is consistent with the last word observed, and a syntactic context model in which generalization is consistent with the global structure of the input. In experiments in English, Finnish, Mandarin, and random regular languages, we demonstrate that neural language models interpolate between these two forms of generalization: their predictions are well-approximated by a log-linear combination of lexical and syntactic predictive distributions. We then show that, in some languages, noise mediates the two forms of generalization: noise applied to input tokens encourages syntactic generalization, while noise in history representations encourages lexical generalization. Finally, we offer a preliminary theoretical explanation of these results by proving that the observed interpolation behavior is expected in log-linear models with a particular feature correlation structure. These results help explain the effectiveness of two popular regularization schemes and show that aspects of sequence model generalization can be understood and controlled.
Highlights
Generalization in count-based language models (LMs) Before the widespread use of neural approaches in NLP, statistical approaches to language modeling were typically defined by explicit independence assumptions governing their generalization in contexts never observed in the training data
This paper offers three steps toward such a characterization: 1. We present an empirical description of neural LM behavior in out-of-distribution contexts like the ones shown in (a–c)
Latent-variable language models based on finitestate machines (Kuhn et al, 1994) explicitly incorporate information from the long-range context by conditioning next-word prediction on abstract global states constrained by global sentence structure
Summary
Generalization in count-based LMs Before the widespread use of neural approaches in NLP, statistical approaches to language modeling were typically defined by explicit independence assumptions governing their generalization in contexts never observed in the training data. Latent-variable language models based on finitestate machines (Kuhn et al, 1994) (or more expressive automata; Chelba and Jelinek 1998, Pauls and Klein 2012) explicitly incorporate information from the long-range context by conditioning next-word prediction on abstract global states constrained by global sentence structure. In models of both kinds, behavior in contexts unlike any seen at training time is be explicitly specified via backoff and smoothing schemes aimed at providing robust estimates of the frequency of rare events (Good, 1953; Katz, 1987; Kneser and Ney, 1995). The precise nature and limits of that generalization— especially its robustness to unusual syntax and its ability to incorporate information about global sentence structure—remain a topic of ongoing study
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.