Unsupervised Cross-Lingual Representation Learning

Ivan Vulić,Anders Søgaard,Sebastian Ruder

doi:10.18653/v1/p19-4007

Abstract

In this tutorial, we provide a comprehensive survey of the exciting recent work on cutting-edge weakly-supervised and unsupervised cross-lingual word representations. After providing a brief history of supervised cross-lingual word representations, we focus on: 1) how to induce weakly-supervised and unsupervised cross-lingual word representations in truly resource-poor settings where bilingual supervision cannot be guaranteed; 2) critical examinations of different training conditions and requirements under which unsupervised algorithms can and cannot work effectively; 3) more robust methods for distant language pairs that can mitigate instability issues and low performance for distant language pairs; 4) how to comprehensively evaluate such representations; and 5) diverse applications that benefit from cross-lingual word representations (e.g., MT, dialogue, cross-lingual sequence labeling and structured prediction applications, cross-lingual IR).

Highlights

We provide a comprehensive survey of the exciting recent work on cuttingedge weakly-supervised and unsupervised crosslingual word representations
Part I: Introduction We first present an overview of cross-lingual NLP research, situating the current work on unsupervised cross-lingual representation learning, and motivating the need for multilingual training and cross-lingual transfer for resource-poor languages with weak supervision or no bilingual supervision at all
We present key downstream applications for cross-lingual word representations, such as bilingual lexicon induction and unsupervised MT (Lample et al, 2018b)

Summary

Motivation and Objectives

Cross-lingual word representations offer an elegant and language-pair independent way to represent content across different languages. Recent work has already verified the usefulness of cross-lingual word representations in a wide variety of downstream tasks, and has provided extensive model classifications in several survey papers (Upadhyay et al, 2016; Ruder et al, 2018b) They cluster supervised cross-lingual word representation models according to the bilingual supervision required to induce such shared cross-lingual semantic spaces, covering models based on word alignments and readily available bilingual dictionaries (Mikolov et al, 2013; Smith et al, 2017), sentence-aligned parallel data (Gouws et al, 2015), document-aligned data (Søgaard et al, 2015; Vulicand Moens, 2016), or even image tags and captions (Rotman et al, 2018). Due to the strong assumption on the similarity of space topology, these models often diverge to nonoptimal solutions, and their robustness is one of the crucial research questions at present (Søgaard et al, 2018) In this tutorial, we provide a comprehensive survey of the exciting recent work on cuttingedge weakly-supervised and unsupervised crosslingual word representations. We will deliver a detailed survey of the current cuttingedge methods, discuss best training and evaluation practices and use-cases, and provide links to publicly available implementations, datasets, and pretrained models and word embedding collections.

Tutorial Overview

Part I: Introduction

Discussion and Final

Tutorial Breadth

Presenters

Findings

Other Important Information