Abstract

Recent efforts in cross-lingual word embedding (CLWE) learning have predominantly focused on fully unsupervised approaches that project monolingual embeddings into a shared cross-lingual space without any cross-lingual signal. The lack of any supervision makes such approaches conceptually attractive. Yet, their only core difference from (weakly) supervised projection-based CLWE methods is in the way they obtain a seed dictionary used to initialize an iterative self-learning procedure. The fully unsupervised methods have arguably become more robust, and their primary use case is CLWE induction for pairs of resource-poor and distant languages. In this paper, we question the ability of even the most robust unsupervised CLWE approaches to induce meaningful CLWEs in these more challenging settings. A series of bilingual lexicon induction (BLI) experiments with 15 diverse languages (210 language pairs) show that fully unsupervised CLWE methods still fail for a large number of language pairs (e.g., they yield zero BLI performance for 87/210 pairs). Even when they succeed, they never surpass the performance of weakly supervised methods (seeded with 500-1,000 translation pairs) using the same self-learning procedure in any BLI setup, and the gaps are often substantial. These findings call for revisiting the main motivations behind fully unsupervised CLWE methods.

Highlights

  • Introduction and MotivationThe wide use and success of monolingual word embeddings in NLP tasks (Turian et al, 2010; Chen and Manning, 2014) has inspired further research focus on the induction of cross-lingual word embeddings (CLWEs)

  • We show that the most robust unsupervised CLWE approach still fails completely when it relies on monolingual word vectors trained on domain-dissimilar corpora

  • While the “no supervision at all” premise behind fully unsupervised CLWE methods is seductive, our study strongly suggests that future research efforts should revisit the main motivation behind these methods and focus on designing even more robust solutions, given their current inability to support a wide spectrum of language pairs

Read more

Summary

Introduction

Introduction and MotivationThe wide use and success of monolingual word embeddings in NLP tasks (Turian et al, 2010; Chen and Manning, 2014) has inspired further research focus on the induction of cross-lingual word embeddings (CLWEs). The landscape of CLWE methods has recently been dominated by the so-called projection-based methods (Mikolov et al, 2013a; Ruder et al, 2019; Glavaš et al, 2019). They align two monolingual embedding spaces by learning a projection/mapping based on a training dictionary of translation pairs. Besides their simple conceptual design and competitive performance, their popularity originates from the fact that they rely on rather weak cross-lingual supervision. The seed dictionaries typically spanned several thousand word pairs (Mikolov et al, 2013a; Faruqui and Dyer, 2014; Xing et al, 2015), but more recent work has shown that CLWEs can be induced with even weaker supervision from small dictionaries spanning several hundred pairs (Vulicand Korhonen, 2016), identical strings (Smith et al, 2017), or even only shared numerals (Artetxe et al, 2017)

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call