Improving Pre-Trained Multilingual Model with Vocabulary Expansion

Hai Wang,Kai Sun,Jianshu Chen,Dong Yu,Dian Yu

doi:10.18653/v1/k19-1030

Hai Wang, Kai Sun + Show 3 more

Open Access

PDF Available

https://doi.org/10.18653/v1/k19-1030

Copy DOI

Export

Save

Cite

Publication Date: Jan 1, 2019
Citations: 15	License type: cc-by

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Recently, pre-trained language models have achieved remarkable success in a broad range of natural language processing tasks. However, in multilingual setting, it is extremely resource-consuming to pre-train a deep language model over large-scale corpora for each language. Instead of exhaustively pre-training monolingual language models independently, an alternative solution is to pre-train a powerful multilingual deep language model over large-scale corpora in hundreds of languages. However, the vocabulary size for each language in such a model is relatively small, especially for low-resource languages. This limitation inevitably hinders the performance of these multilingual models on tasks such as sequence labeling, wherein in-depth token-level or sentence-level understanding is essential. In this paper, inspired by previous methods designed for monolingual settings, we investigate two approaches (i.e., joint mapping and mixture mapping) based on a pre-trained multilingual model BERT for addressing the out-of-vocabulary (OOV) problem on a variety of tasks, including part-of-speech tagging, named entity recognition, machine translation quality estimation, and machine reading comprehension. Experimental results show that using mixture mapping is more promising. To the best of our knowledge, this is the first work that attempts to address and discuss the OOV issue in multilingual settings.

Highlights

It has been shown that performance on many natural language processing tasks drops dramatically on held-out data when a significant percentage of words do not appear in the training data, i.e., out-of-vocabulary (OOV) words (Søgaard and Johannsen, 2012; Madhyastha et al, 2016)
In this paper, inspired by previous methods designed for monolingual settings, we investigate two approaches based on a pre-trained multilingual model Bidirectional Encoder Representations from Transformers (BERT) for addressing the out-of-vocabulary (OOV) problem on a variety of tasks, including part-of-speech tagging, named entity recognition, machine translation quality estimation, and machine reading comprehension
Due to the expensive computation of softmax (Yang et al, 2017) and data imbalance across different languages, the vocabulary size for each language in a multilingual model is relatively small compared to the monolingual BERT/Generative Pre-Training (GPT) models, especially for lowresource languages

Summary

Introduction

It has been shown that performance on many natural language processing tasks drops dramatically on held-out data when a significant percentage of words do not appear in the training data,. Instead of pre-training many monolingual models like the existing English GPT, English BERT, and Chinese BERT, a more natural choice is to develop a powerful multilingual model such as the multilingual BERT. All those pre-trained models rely on language modeling, where a common trick is to tie the weights of softmax and word embeddings (Press and Wolf, 2017). To address the OOV problems, instead of pre-training a deep model with a large vocabulary, we aim at enlarging the vocabulary size when we fine-tune a pretrained multilingual model on downstream tasks.

Approach

Pre-Trained BERT

Vocabulary Expansion

Experiment Settings

Discussions about Mapping Methods

Monolingual Sequence Labeling Tasks

Code-Mixed Sequence Labeling Tasks

Sequence Classification Tasks

Discussions

Related Work

Monolingual Setting

Findings

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Improving Pre-Trained Multilingual Model with Vocabulary Expansion

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

Lead the way for us

Similar Papers

IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages
Divyanshu Kakwani ... Satish Golla
-
Divyanshu Kakwani, et. al.Divyanshu Kakwani ... Satish Golla
01 Jan 2020
01 Jan 2020

Effectiveness of Pre-Trained Language Models for the Japanese Winograd Schema Challenge
Keigo Takahashi ... Teruaki Oka
Journal of Advanced Computational Intelligence and Intelligent Informatics | VOL. 27
Keigo Takahashi, et. al.Keigo Takahashi ... Teruaki Oka
20 May 2023
Journal of Advanced Computational Intelligence and Intelligent Informatics | VOL. 27

Are Multilingual Models Effective in Code-Switching?
Genta Indra Winata ... Andrea Madotto
-
Genta Indra Winata, et. al.Genta Indra Winata ... Andrea Madotto
01 Jan 2020
01 Jan 2020

Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation
Zhiqi Huang ... Puxuan Yu
-
Zhiqi Huang, et. al.Zhiqi Huang ... Puxuan Yu
27 Feb 2023
27 Feb 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Improving Pre-Trained Multilingual Model with Vocabulary Expansion

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers