Neural machine translation has predominantly outperformed previous machine translation models primarily for resourceful languages. However, very little work has been reported for resource-constrained languages such as Khasi. The Khasi language belongs to the Mon-Khmer branch of the Austroasiatic language family and is spoken primarily in the state of Meghalaya in India. Although performing neural machine translation for the under-resourced Khasi language is difficult, we build a substantial parallel corpus of English–Khasi. We apply three segmentation methods in the datasets for our experiments: untokenized, tokenized and a subword BPE (Byte Pair Encoding). Experiments are carried out on this dataset with different aspects of neural machine translation systems using cutting-edge architectures such as LSTM (Long Short Term Memory), GRU (Gated Recurrent Unit) and a transformer based model for the English–Khasi language pair. We also carry out experiments by adapting the transfer learning approach using English-Vietnamese as the parent language pair and English-Khasi as the child language pair. This work reports a quantitative and qualitative analysis of several models based on architectural and data segmentation methodologies. The experimental findings show that the model adapted using the transfer learning approach achieved a reasonable improvement in BLEU scores with the highest being 58.1 BLEU on similar domains and 17.7 BLEU for the general domain outscoring the other models for the same language pair. Qualitative analysis is carried out focusing on the morphological inflections of gender identification in the translated output.
Read full abstract