Emergent linguistic structure in artificial neural networks trained by self-supervision

Kevin Clark,John Hewitt,Christopher D Manning,Urvashi Khandelwal,Omer Levy

doi:10.1073/pnas.1907367117

Kevin Clark, John Hewitt + Show 3 more

Open Access

https://doi.org/10.1073/pnas.1907367117

Copy DOI

Abstract

This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model simply tries to predict a masked word in a given context. Human language communication is via sequences of words, but language understanding requires constructing rich hierarchical structures that are never observed explicitly. The mechanisms for this have been a prime mystery of human language acquisition, while engineering work has mainly proceeded by supervised learning on treebanks of sentences hand labeled for this latent structure. However, we demonstrate that modern deep contextual language models learn major aspects of this structure, without any explicit supervision. We develop methods for identifying linguistic hierarchical structure emergent in artificial neural networks and demonstrate that components in these models focus on syntactic grammatical relationships and anaphoric coreference. Indeed, we show that a linear transformation of learned embeddings in these models captures parse tree distances to a surprising degree, allowing approximate reconstruction of the sentence tree structures normally assumed by linguists. These results help explain why these models have brought such large improvements across many language-understanding tasks.

Highlights

This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model tries to predict a masked word in a given context
Since many of Bidirectional Encoder Representations from Transformers (BERT)’s attention heads encode individual syntactic relations, it is natural to wonder whether the representation, that is, the vectors that represent the words in each layer of BERT, embed syntax trees
We find that dependency tree structures are embedded in BERT representations to a striking extent.**

Summary

Introduction

This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model tries to predict a masked word in a given context. Human language communication is via sequences of words, but language understanding requires constructing rich hierarchical structures that are never observed explicitly. The mechanisms for this have been a prime mystery of human language acquisition, while engineering work has mainly proceeded by supervised learning on treebanks of sentences hand labeled for this latent structure. Researchers train progressively better supervised machine-learning models on the treebank, which attempt to recover this structure for any sentence [6,7,8] This approach has been very effective as an engineering solution, but beyond the high practical cost of human labeling, it gives no insight into how children might approach structure induction from observed data alone. This has been the dominant perspective in linguistics, where language models have long been seen as inadequate and having no scientific interest, even when their usefulness in practical engineering applications is grudgingly accepted [15, 16]

Objectives

Methods

Results

Conclusion