Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages

Saurav Jha,Anil Kumar Singh,Akhilesh Sudhakar

doi:10.15398/jlm.v7i2.214

Abstract

Out-of-vocabulary (OOV) words can pose serious challenges for machine translation (MT) tasks, and in particular, for low-resource language (LRL) pairs, i.e., language pairs for which few or no parallel corpora exist. Our work adapts variants of seq2seq models to perform transduction of such words from Hindi to Bhojpuri (an LRL instance), learning from a set of cognate pairs built from a bilingual dictionary of Hindi - Bhojpuri words. We demonstrate that our models can be effectively used for language pairs that have limited parallel corpora; our models work at the character level to grasp phonetic and orthographic similarities across multiple types of word adaptations, whether synchronic or diachronic, loan words or cognates. We describe the training aspects of several character level NMT systems that we adapted to this task and characterize their typical errors. Our method improves BLEU score by 6.3 on the Hindi-to-Bhojpuri translation task. Further, we show that such transductions can generalize well to other languages by applying it successfully to Hindi - Bangla cognate pairs. Our work can be seen as an important step in the process of: (i) resolving the OOV words problem arising in MT tasks; (ii) creating effective parallel corpora for resource constrained languages; and (iii) leveraging the enhanced semantic knowledge captured by word-level embeddings to perform character-level tasks.

Highlights

We borrow ideas from previous approaches that have used cognates. Simard et al (1993) use cognates to align sentences in a parallel corpus and report 97.6% accuracy on the alignments obtained, when compared to reference alignments. Mann and Yarowsky (2001) use cognates extracted based on edit distances for inducing translation lexicons based on transduction models. Scannell (2006) presents a detailed study on translation of a closely related language pair, IrishScottish Gaelic
For the Alignment model (AM) and Hierarchical attention network (HAN) models, we consider various parameters while training, such as LSTM/GRU as encoding/decoding units, sequence chunking and batch sizes, optimization methods, regularization, and we report that there is a huge variance in the transduction performance depending on the combinations of the parameters used
We report the accuracy of each experiment using the BLEU score and Levenshtein distance-based string similarity (SS) measure, as in Equation 3.8 After obtaining the optimum hyperparameter set for AM and HAN, we compare the word accuracy (WA, Equation 4) report defined by the percentage of correctly translated words for all the models including the state of the art (SOTA)

Summary

Introduction

We borrow ideas from previous approaches that have used cognates. Simard et al (1993) use cognates to align sentences in a parallel corpus and report 97.6% accuracy on the alignments obtained, when compared to reference alignments. Mann and Yarowsky (2001) use cognates extracted based on edit distances for inducing translation lexicons based on transduction models. Scannell (2006) presents a detailed study on translation of a closely related language pair, IrishScottish Gaelic. Scannell (2006) presents a detailed study on translation of a closely related language pair, IrishScottish Gaelic. They learn transfer rules based on alignment of cognate pairs, and use these rules to generate transductions on new words. They use a fine-grained cognate extraction method, by first editing Scottish words to ‘seem like’ Gaelic words, and using edit string similarity on the new word pairs and choosing only close words with the additional constraint that both words in the pair should share a common English translation. Since we use linguistic experts to extract cognates from our dataset, we do not need to encode string similarity measures explicitly to extract cognates

Methods

Results

Conclusion