Investigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology

Rochelle Choenni,Ekaterina Shutova

doi:10.1162/coli_a_00444

Rochelle Choenni, Ekaterina Shutova

Open Access

https://doi.org/10.1162/coli_a_00444

Copy DOI

Journal: Computational Linguistics	Publication Date: Sep 1, 2022
Citations: 3	License type: CC BY-NC-ND 4.0

Affiliation: University of Amsterdam

Abstract

Abstract Multilingual sentence encoders have seen much success in cross-lingual model transfer for downstream NLP tasks. The success of this transfer is, however, dependent on the model’s ability to encode the patterns of cross-lingual similarity and variation. Yet, we know relatively little about the properties of individual languages or the general patterns of linguistic variation that the models encode. In this article, we investigate these questions by leveraging knowledge from the field of linguistic typology, which studies and documents structural and semantic variation across languages. We propose methods for separating language-specific subspaces within state-of-the-art multilingual sentence encoders (LASER, M-BERT, XLM, and XLM-R) with respect to a range of typological properties pertaining to lexical, morphological, and syntactic structure. Moreover, we investigate how typological information about languages is distributed across all layers of the models. Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies. In addition, we propose a simple method to study how shared typological properties of languages are encoded in two state-of-the-art multilingual models—M-BERT and XLM-R. The results provide insight into their information-sharing mechanisms and suggest that these linguistic properties are encoded jointly across typologically similar languages in these models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Investigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology

Abstract

Talk to us

Similar Papers

More From: Computational Linguistics

Lead the way for us

Similar Papers

Problematic Use of Greenberg's Linguistic Classification of the Americas in Studies of Native American Genetic Variation
Deborah A (Weiss) Bolnick ... Ives Goddard
The American Journal of Human Genetics | VOL. 75
Deborah A (Weiss) Bolnick, et. al.Deborah A (Weiss) Bolnick ... Ives Goddard
01 Sep 2004
The American Journal of Human Genetics | VOL. 75

Pretraining and Fine-Tuning Strategies for Sentiment Analysis of Latvian Tweets
Gaurish Thakkar ... Mārcis Pinnis
-
Gaurish Thakkar, et. al.Gaurish Thakkar ... Mārcis Pinnis
15 Sep 2020
15 Sep 2020

Isomorphic Transfer of Syntactic Structures in Cross-Lingual NLP
Edoardo Maria Ponti ... Ivan Vulić
-
Edoardo Maria Ponti, et. al.Edoardo Maria Ponti ... Ivan Vulić
01 Jan 2018
01 Jan 2018

A Simple Geometric Method for Cross-Lingual Linguistic Transformations with Pre-trained Autoencoders
Maarten De Raedt ... Pieter Buteneers
-
Maarten De Raedt, et. al.Maarten De Raedt ... Pieter Buteneers
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Investigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology

Abstract

Talk to us

Similar Papers

More From: Computational Linguistics