Abstract

Neural machine translation has predominantly outperformed previous machine translation models primarily for resourceful languages. However, very little work has been reported for resource-constrained languages such as Khasi. The Khasi language belongs to the Mon-Khmer branch of the Austroasiatic language family and is spoken primarily in the state of Meghalaya in India. Although performing neural machine translation for the under-resourced Khasi language is difficult, we build a substantial parallel corpus of English–Khasi. We apply three segmentation methods in the datasets for our experiments: untokenized, tokenized and a subword BPE (Byte Pair Encoding). Experiments are carried out on this dataset with different aspects of neural machine translation systems using cutting-edge architectures such as LSTM (Long Short Term Memory), GRU (Gated Recurrent Unit) and a transformer based model for the English–Khasi language pair. We also carry out experiments by adapting the transfer learning approach using English-Vietnamese as the parent language pair and English-Khasi as the child language pair. This work reports a quantitative and qualitative analysis of several models based on architectural and data segmentation methodologies. The experimental findings show that the model adapted using the transfer learning approach achieved a reasonable improvement in BLEU scores with the highest being 58.1 BLEU on similar domains and 17.7 BLEU for the general domain outscoring the other models for the same language pair. Qualitative analysis is carried out focusing on the morphological inflections of gender identification in the translated output.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.