Abstract
Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.
Highlights
Bidirectional Encoder Representations from Transformers (BERT) [1] is a Transformer network [2] pretrained with a language modeling objective and a vast amount of raw text
BERTology has become one of the most influential and active research areas in Natural Language Processing (NLP). This led to the development of many improved architectures and training methodologies for Pretrained Language Models (PLMs), such as RoBERTa [3], ALBERT [4], BART [5], ELECTRA [6], and GPT [7,8], improving various NLP systems and even achieving superhuman performance [1,9,10]
Evaluating our method on four challenging Korean natural language understanding tasks, we find that cross-lingual post-training is extremely effective at increasing data efficiency
Summary
Bidirectional Encoder Representations from Transformers (BERT) [1] is a Transformer network [2] pretrained with a language modeling objective and a vast amount of raw text. BERTology has become one of the most influential and active research areas in Natural Language Processing (NLP). This led to the development of many improved architectures and training methodologies for Pretrained Language Models (PLMs), such as RoBERTa [3], ALBERT [4], BART [5], ELECTRA [6], and GPT [7,8], improving various NLP systems and even achieving superhuman performance [1,9,10]. More than the half of all documents are from these 10 languages, which indicates that language resource availability remains highly imbalanced
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have