Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling

Akshai Ramesh,Haque Usuf Uhana,Andy Way,Venkatesh Balavadhani Parthasarathy,Rejwanul Haque

doi:10.1109/ijcnn52387.2021.9534211

Akshai Ramesh, Haque Usuf Uhana + Show 3 more

https://doi.org/10.1109/ijcnn52387.2021.9534211

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Neural machine translation (NMT) is often described as ‘data hungry’ as it typically requires large amounts of parallel data in order to build a good-quality machine translation (MT) system. However, most of the world's language-pairs are low-resource or extremely low-resource. This situation becomes even worse if a specialised domain is taken into consideration for translation. In this paper, we present a novel data augmentation method which makes use of bilingual word embeddings (BWEs) learned from monolingual corpora and bidirectional encoder representations from transformer (BERT) language models (LMs). We augment a parallel training corpus by introducing new words (i.e. out-of-vocabulary (OOV) items) and increasing the presence of rare words on both sides of the original parallel training corpus. Our experiments on the simulated low-resource German–English and French–English translation tasks show that the proposed data augmentation strategy can significantly improve state-of-the-art NMT systems and outperform the state-of-the-art data augmentation approach for low-resource NMT.

Full Text