Are All Languages Created Equal in Multilingual BERT?

Shijie Wu,Mark Dredze

doi:10.18653/v1/2020.repl4nlp-1.16

Abstract

Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer with high-resource languages, covering only a third of the languages covered by mBERT. We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance. We consider three tasks: Named Entity Recognition (99 languages), Part-of-speech Tagging, and Dependency Parsing (54 languages each). mBERT does better than or comparable to baselines on high resource languages but does much worse for low resource languages. Furthermore, monolingual BERT models for these languages do even worse. Paired with similar languages, the performance gap between monolingual BERT and mBERT can be narrowed. We find that better models for low resource languages require more efficient pretraining techniques or more data.

Highlights

Pretrained contextual representation models trained with language modeling (Peters et al, 2018; Yang et al, 2019) or the cloze task objectives (Devlin et al, 2019; Liu et al, 2019) have quickly set a new standard for NLP tasks
We focus on multilingual BERT (mBERT) from the perspective of representation learning for each language in terms of monolingual corpora resources and analyze how to improve BERT for low resource languages
While we may expect that pretraining representations with mBERT would be most beneficial for languages with only 100 labels, as Howard and Ruder (2018) show pretraining improve data-efficiency for English on text classification, our results show that on low resource languages this strategy performs much worse than a model trained directly on the available task data

Summary

Introduction

Pretrained contextual representation models trained with language modeling (Peters et al, 2018; Yang et al, 2019) or the cloze task objectives (Devlin et al, 2019; Liu et al, 2019) have quickly set a new standard for NLP tasks. Does multilingual joint training help mBERT learn better representation for low resource languages?. By training various monolingual BERT for low-resource languages with the same data size, we show the low representation quality of lowresource languages is not the result of the hyperparameters of BERT or sharing the model with a large number of languages, as monolingual BERT performs worse than mBERT. By pairing low-resource languages with linguisticallyrelated languages, we show low-resource languages benefit from multilingual joint training, as bilingual BERT outperforms monolingual BERT while still lacking behind mBERT, Proceedings of the 5th Workshop on Representation Learning for NLP (RepL4NLP-2020), pages 120–130 July 9, 2020. With small monolingual corpus, BERT does not learn high-quality representation for low resource languages. We leave exploring more data-efficient pretraining techniques as future work

Related Work

Experimental Setup

Masked Language Model Pretraining

Statistical Analysis

Findings

Discussion