Low-resource Machine Translation Research Articles

The quality of data-driven Machine Translation (MT) strongly depends on the quantity as well as the quality of the training dataset. However, collecting a large set of training parallel texts is not easy in practice. Although various approaches have already been proposed to overcome this issue, the lack of large parallel corpora still poses a major practical problem for many language pairs. Since monolingual data plays an important role in boosting fluency for Neural MT (NMT) models, this paper investigates and compares the performance of two learning-based translation approaches for Spanish-Turkish translation as a low-resource setting in case we only have access to large sets of monolingual data in each language; 1) Unsupervised Learning approach, and 2) Round-Tripping approach. Either approach completely removes the need for bilingual data or enables us to train the NMT system relying on monolingual data only. We utilize an Attention-based NMT (Attentional NMT) model, which leverages a careful initialization of the parameters, the denoising effect of language models, and the automatic generation of bilingual data. Our experimental results demonstrate that the Unsupervised Learning approach outperforms the Round-Tripping approach in Spanish-Turkish translation and vice versa. These results confirm that the Unsupervised Learning approach is still a reliable learning-based translation technique for Spanish-Turkish low-resource NMT.

Read full abstract

Neural machine translation has recently been able to gain state-of-the-art translation quality for many language pairs. However, neural machine translation has been less tested for English-Bangla language pair, two linguistically distant and widely spoken languages. In this paper, we apply neural machine translation to the task of English-Bangla translation in both directions and compare it against a standard phrase-based statistical machine translation system. We obtain up to +0.30 and +4.95 BLEU improvement over phrase-based statistical machine translation for English-to-Bangla and Bangla-to-English respectively. Due to low-resource and morphological richness of Bangla, English-Bangla translation task produces a large number of rare words. We apply subword segmentation with byte pair encoding to handle this rare words issue. We obtain up to +0.69 and +0.30 BLEU improvement over baseline neural machine translation for English-to-Bangla and Bangla-to-English respectively. We further investigate our system output for several challenging linguistic properties like subject-verb agreement, noun inflection, long distance reordering and rare words translation. We observe that neural machine translation with and without subword segmentation significantly outperform the phrase-based statistical machine translation system, thus establishing itself as the state-of-the-art technology for low-resource English-Bangla machine translation.

Read full abstract

Low-resource Machine Translation Research Articles

Related Topics

Articles published on Low-resource Machine Translation

Surprise Language Challenge: Developing a Neural Machine Translation System between Pashto and English in Two Months

The Usefulness of Bibles in Low-Resource Machine Translation

Exploiting Translation Model for Parallel Corpus Mining

Multilingual Denoising Pre-training for Neural Machine Translation

Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings

Revisiting Back-Translation for Low-Resource Machine Translation Between Chinese and Vietnamese

Spanish-Turkish Low-Resource Machine Translation: Unsupervised Learning vs Round-Tripping

Neural Machine Translation for Low-resource English-Bangla

Neighbors helping the poor: improving low-resource machine translation using related languages

BBN’s low-resource machine translation for the LoReHLT 2016 evaluation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Low-resource Machine Translation Research Articles

Related Topics

Articles published on Low-resource Machine Translation

Surprise Language Challenge: Developing a Neural Machine Translation System between Pashto and English in Two Months

The Usefulness of Bibles in Low-Resource Machine Translation

Exploiting Translation Model for Parallel Corpus Mining

Multilingual Denoising Pre-training for Neural Machine Translation

Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings

Revisiting Back-Translation for Low-Resource Machine Translation Between Chinese and Vietnamese

Spanish-Turkish Low-Resource Machine Translation: Unsupervised Learning vs Round-Tripping

Neural Machine Translation for Low-resource English-Bangla

Neighbors helping the poor: improving low-resource machine translation using related languages

BBN’s low-resource machine translation for the LoReHLT 2016 evaluation