Abstract

The success of large pretrained language models (LMs) such as BERT and RoBERTa has sparked interest in probing their representations, in order to unveil what types of knowledge they implicitly capture. While prior research focused on morphosyntactic, semantic, and world knowledge, it remains unclear to which extent LMs also derive lexical type-level knowledge from words in context. In this work, we present a systematic empirical analysis across six typologically diverse languages and five different lexical tasks, addressing the following questions: 1) How do different lexical knowledge extraction strategies (monolingual versus multilingual source LM, out-of-context versus in-context encoding, inclusion of special tokens, and layer-wise averaging) impact performance? How consistent are the observed effects across tasks and languages? 2) Is lexical knowledge stored in few parameters, or is it scattered throughout the network? 3) How do these representations fare against traditional static word vectors in lexical tasks 4) Does the lexical information emerging from independently trained monolingual LMs display latent similarities? Our main results indicate patterns and best practices that hold universally, but also point to prominent variations across languages and tasks. Moreover, we validate the claim that lower Transformer layers carry more type-level lexical knowledge, but also show that this knowledge is distributed across multiple layers.

Highlights

  • Introduction and MotivationLanguage models (LMs) based on deep Transformer networks (Vaswani et al, 2017), pretrained on unprecedentedly large amounts of text, offer unmatched performance in virtually every NLP task (Qiu et al, 2020)

  • Our study aims at providing answers to the following key questions: Q1) Do lexical extraction strategies generalise across different languages and tasks, or do they rather require language- and taskspecific adjustments?; Q2) Is lexical information concentrated in a small number of parameters and layers, or scattered throughout the encoder?; Q3) Are “BERT-based” static word embeddings competitive with traditional word embeddings such as fastText?; Q4) Do monolingual LMs independently trained in multiple languages learn structurally similar representations for words denoting similar concepts?

  • A summary of the results is shown in Figure 2 for lexical semantic similarity (LSIM), in Figure 3a for bilingual lexicon induction (BLI), in Figure 3b for cross-lingual information retrieval (CLIR), in Figure 4a and Figure 4b for Relation Prediction (RELP), and in Figure 4c for word analogy resolution (WA)

Read more

Summary

Introduction

Language models (LMs) based on deep Transformer networks (Vaswani et al, 2017), pretrained on unprecedentedly large amounts of text, offer unmatched performance in virtually every NLP task (Qiu et al, 2020). Models such as BERT (Devlin et al, 2019), RoBERTa (Liu et al, 2019c), and T5 (Raffel et al, 2019) replaced task-specific neural architectures that relied on static word embeddings (WEs; Mikolov et al, 2013b; Pennington et al, 2014; Bojanowski et al, 2017), where each word is assigned a single (type-level) vector. While preliminary findings from Ethayarajh (2019) and Vulicet al. (2020) suggest that there is a wealth of lexical knowledge available within the parameters of BERT and other LMs, a systematic empirical study across different languages is currently lacking

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call