Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings

Lisa Woller,Viktor Hangya,Alexander Fraser

doi:10.18653/v1/2021.mrl-1.4

Abstract

Cross-lingual word embeddings (CLWEs) have proven indispensable for various natural language processing tasks, e.g., bilingual lexicon induction (BLI). However, the lack of data often impairs the quality of representations. Various approaches requiring only weak cross-lingual supervision were proposed, but current methods still fail to learn good CLWEs for languages with only a small monolingual corpus. We therefore claim that it is necessary to explore further datasets to improve CLWEs in low-resource setups. In this paper we propose to incorporate data of related high-resource languages. In contrast to previous approaches which leverage independently pre-trained embeddings of languages, we (i) train CLWEs for the low-resource and a related language jointly and (ii) map them to the target language to build the final multilingual space. In our experiments we focus on Occitan, a low-resource Romance language which is often neglected due to lack of resources. We leverage data from French, Spanish and Catalan for training and evaluate on the Occitan-English BLI task. By incorporating supporting languages our method outperforms previous approaches by a large margin. Furthermore, our analysis shows that the degree of relatedness between an incorporated language and the low-resource language is critically important.

Highlights

Since recent research is more and more interested in dealing with low-resource languages, learning multilingual representations for low-resource often impairs the quality of representations
Cross-lingual word embeddings (CLWEs) are im- quirements for training data, methods which ofportant for a wide range of NLP tasks including fer opportunities precisely for low-resource sebilingual lexicon induction (BLI) (Vulicand Ko- tups, like leveraging data from linguistically related rhonen, 2016; Patra et al, 2019), Machine Transla- high-resource languages, should be considered as tion (Lample et al, 2018), and cross-lingual trans- well
By evaluating our final multilingual embedding space on the Occitan-English BLI task, we show that our method improves CLWEs for these languages compared to all the baseline settings

Summary

Types overall Types shared

Since isomorphism of vector spaces is correlated with mapping performance (Søgaard et al, 2018; Ormazabal et al, 2019) and given that high-quality alignments among English and the supporting lan-. Instead of building embeddings on the concatenated corpora of the two languages (Lample et al, 2018), we use the joint-align model proposed by Wang et al (2020). In their approach, CLWEs are learned in three steps, which we outline in the following. Since related languages share part of their vocabulary, these words act as a cross-lingual signal to automatically align the vectors of the two training with the monolingual target language embeddings. Corpora and vocabulary We pursue our experiments for the low-resource Occitan language and we choose French, Spanish, and Catalan as supporting languages.

Occitan Catalan Shared words

Oc b c d GPA

Our model âge oiseau bank

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 1	License type: cc-by

Similar Papers

Isomorphic Cross-lingual Embeddings for Low-Resource Languages
...
-
, et. al. ...
12 May 2022
12 May 2022

Do We Really Need Fully Unsupervised Cross-Lingual Embeddings?
Ivan Vulić ... Anna Korhonen
-
Ivan Vulić, et. al.Ivan Vulić ... Anna Korhonen
01 Jan 2019
01 Jan 2019

Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data

-

01 Aug 2021
01 Aug 2021

Why Overfitting Isn’t Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries
Mozhi Zhang ... Yoshinari Fujinuma
-
Mozhi Zhang, et. al.Mozhi Zhang ... Yoshinari Fujinuma
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings

Abstract

Highlights

Summary

Talk to us

Similar Papers