Abstract

Over the past ten years, neural machine translation (NMT) has seen tremendous growth and is now entering a phase of maturity. Despite being the most popular solution for machine translation (MT), it performs sub-optimally on under-resourced language pairs due to lack of parallel corpora as compared to high-resourced language pairs. The implementation of NMT techniques for under-resourced language pairs is receiving the attention of researchers and has resulted in a significant amount of research for many under-resourced language pairs. In view of the growth of MT, this paper describes a set of practical approaches for investigating MT between Kannada and Tulu. These two languages belong to the family of Dravidian languages and are under-resourced due to lack of tools and resources particularly the parallel corpus for MT. Since there are no parallel corpora for the Kannada-Tulu language pair for MT, this work aims to construct a parallel corpus for this language pair. As manual construction of parallel corpus is laborious, data augmentation is introduced to enhance the size of the parallel corpus along with suitable preprocessing techniques. Different NMT schemes such as recurrent neural network (RNN) baseline, bidirectional recurrent neural network (BiRNN), transformer-based NMT with and without subword tokenization, and statistical machine translation (SMT) models are implemented for MT of Kannada-Tulu and Tulu-Kannada language pairs. Empirical results reveal that the impact of data augmentation increases the bilingual evaluation understudy (BLEU) score of the proposed models. Transformer-based models with subword tokenization outperformed the other models with BLEU scores 41.82 and 40.91 for Kannada-Tulu and Tulu-Kannada MT, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call