Abstract
Cross-lingual word embeddings (CLWEs) have proven indispensable for various natural language processing tasks, e.g., bilingual lexicon induction (BLI). However, the lack of data often impairs the quality of representations. Various approaches requiring only weak cross-lingual supervision were proposed, but current methods still fail to learn good CLWEs for languages with only a small monolingual corpus. We therefore claim that it is necessary to explore further datasets to improve CLWEs in low-resource setups. In this paper we propose to incorporate data of related high-resource languages. In contrast to previous approaches which leverage independently pre-trained embeddings of languages, we (i) train CLWEs for the low-resource and a related language jointly and (ii) map them to the target language to build the final multilingual space. In our experiments we focus on Occitan, a low-resource Romance language which is often neglected due to lack of resources. We leverage data from French, Spanish and Catalan for training and evaluate on the Occitan-English BLI task. By incorporating supporting languages our method outperforms previous approaches by a large margin. Furthermore, our analysis shows that the degree of relatedness between an incorporated language and the low-resource language is critically important.
Highlights
Since recent research is more and more interested in dealing with low-resource languages, learning multilingual representations for low-resource often impairs the quality of representations
Cross-lingual word embeddings (CLWEs) are im- quirements for training data, methods which ofportant for a wide range of NLP tasks including fer opportunities precisely for low-resource sebilingual lexicon induction (BLI) (Vulicand Ko- tups, like leveraging data from linguistically related rhonen, 2016; Patra et al, 2019), Machine Transla- high-resource languages, should be considered as tion (Lample et al, 2018), and cross-lingual trans- well
By evaluating our final multilingual embedding space on the Occitan-English BLI task, we show that our method improves CLWEs for these languages compared to all the baseline settings
Summary
Since isomorphism of vector spaces is correlated with mapping performance (Søgaard et al, 2018; Ormazabal et al, 2019) and given that high-quality alignments among English and the supporting lan-. Instead of building embeddings on the concatenated corpora of the two languages (Lample et al, 2018), we use the joint-align model proposed by Wang et al (2020). In their approach, CLWEs are learned in three steps, which we outline in the following. Since related languages share part of their vocabulary, these words act as a cross-lingual signal to automatically align the vectors of the two training with the monolingual target language embeddings. Corpora and vocabulary We pursue our experiments for the low-resource Occitan language and we choose French, Spanish, and Catalan as supporting languages.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.