Dissecting Contextual Word Embeddings: Architecture and Representation

Matthew Peters,Luke Zettlemoyer,Wen-Tau Yih,Mark Neumann

doi:10.18653/v1/d18-1179

Abstract

Contextual word representations derived from pre-trained bidirectional language models (biLMs) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks. However, many questions remain as to how and why these models are so effective. In this paper, we present a detailed empirical study of how the choice of neural architecture (e.g. LSTM, CNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. We show there is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks. Additionally, all architectures learn representations that vary with network depth, from exclusively morphological based at the word embedding layer through local syntax based in the lower contextual layers to longer range semantics such coreference at the upper layers. Together, these results suggest that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.

Highlights

We ran a series of controlled trials by swapping out pre-trained GloVe vectors (Pennington et al, 2014) for contextualized word vectors from each bidirectional language models (biLMs) computed by applying the learned weighted average ELMo pooling from Peters et al (2018)
Our experiments show that deep biLMs learn representations that vary with network depth, from morphology in the word embedding layer, to local syntax in the lowest contextual layers, to semantic relationships such as coreference in the upper layers
We have shown that deep biLMs learn a rich hierarchy of contextual information, both at the word and span level, and that this is captured in three disparate types of network architectures

Summary

Introduction

Contextualized word embeddings (Peters et al, 2018) derived from pre-trained bidirectional language models (biLMs) have been shown to substantially improve performance for many NLP tasks including question answering, entailment and sentiment classification (Peters et al, 2018), constituency parsing (Kitaev and Klein, 2018; Joshi et al, 2018), named entity recognition (Peters et al, 2017), and text classification (Howard and Ruder, 2018). We take a step towards such understanding by empirically studying how the choice of neural architecture (e.g. LSTM, CNN, or self attention) influences both direct endtask accuracies and the types of neural representations that are induced (e.g. how do they encode notions of syntax and semantics). Previous work on learning contextual representations has used LSTM-based biLMs, but there is no prior reason to believe this is the best possible architecture. More computationally efficient networks have been introduced for sequence modeling including including gated CNNs for language modeling (Dauphin et al, 2017) and feed forward self-attention based approaches for machine translation (Transformer; Vaswani et al, 2017). As RNNs are forced to compress the entire history into a hidden state vector before making predictions while CNNs with a large receptive field and the Transformer may directly reference previous tokens, each architecture will represent information in a different manner

Objectives

Results

Conclusion