English–Welsh Cross-Lingual Embeddings

Luis Espinosa-Anke,Geraint Palmer,Dawn Knight,Padraig Corcoran,Irena Spasić,Maxim Filimonov

doi:10.3390/app11146541

Luis Espinosa-Anke, Geraint Palmer + Show 4 more

Open Access

https://doi.org/10.3390/app11146541

Copy DOI

Journal: Applied sciences	Publication Date: Jul 16, 2021
Citations: 3	License type: CC BY 4.0

Affiliation: Cardiff University

Abstract

Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English–Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points.

Highlights

A popular research direction in current natural language processing (NLP) research consists of learning vector space representations of words for two or more languages, and applying some kind of transformation to one of the spaces such that “cross-lingual synonyms”, i.e., words with the same meaning across languages, are assigned similar vector space representations
We report these results in terms of accuracy (ACC .), i.e., we record a true positive only if the nearest neighbor in the mapped space is a translation of the source word
The task of bilingual dictionary induction, a natural byproduct of learning bilingual mappings, and which we have introduced in Section 4, is a good proxy for evaluating the quality of cross-lingual mappings

Summary

Introduction

A popular research direction in current natural language processing (NLP) research consists of learning vector space representations of words for two or more languages, and applying some kind of transformation to one of the spaces such that “cross-lingual synonyms”, i.e., words with the same meaning across languages, are assigned similar vector space representations. The applications of these cross-lingual embeddings into downstream tasks is indisputable today, ranging from information retrieval [1], entity linking [2], text classification [3,4], as well as natural language inference or lexical semantics [5].

Methods

Results

Conclusion