Abstract

Phrase-based statistical machine translation (PB-SMT) has been the dominant paradigm in machine translation (MT) research for more than two decades. Deep neural MT models have been producing state-of-the-art performance across many translation tasks for four to five years. To put it another way, neural MT (NMT) took the place of PB-SMT a few years back and currently represents the state-of-the-art in MT research. Translation to or from under-resourced languages has been historically seen as a challenging task. Despite producing state-of-the-art results in many translation tasks, NMT still poses many problems such as performing poorly for many low-resource language pairs mainly because of its learning task’s data-demanding nature. MT researchers have been trying to address this problem via various techniques, e.g., exploiting source- and/or target-side monolingual data for training, augmenting bilingual training data, and transfer learning. Despite some success, none of the present-day benchmarks have entirely overcome the problem of translation in low-resource scenarios for many languages. In this work, we investigate the performance of PB-SMT and NMT on two rarely tested under-resourced language pairs, English-To-Tamil and Hindi-To-Tamil, taking a specialised data domain into consideration. This paper demonstrates our findings and presents results showing the rankings of our MT systems produced via a social media-based human evaluation scheme.

Highlights

  • In recent years, machine translation (MT) researchers have proposed approaches to counter the data sparsity problem and to improve the performance of neural MT (NMT)systems in low‐resource scenarios, e.g., augmenting training data from source and/or tar‐get monolingual corpora [1,2], unsupervised learning strategies in the absence of labelled data [3,4], exploiting training data involving other languages [5,6], multi‐task learning [7], the selection of hyperparameters [8], and pre‐trained language model fine‐tuning [9]

  • We present the comparative performance of the phrase‐based statistical machine translation (PB‐SMT) and NMT systems in terms of the widely used automatic evaluation metric BLEU

  • When we looked at the second position, we saw that NMT was the winner and that PB‐SMT was not far behind, and the same was true for PB‐SMT and GT

Read more

Summary

Introduction

Machine translation (MT) researchers have proposed approaches to counter the data sparsity problem and to improve the performance of neural MT (NMT)systems in low‐resource scenarios, e.g., augmenting training data from source and/or tar‐get monolingual corpora [1,2], unsupervised learning strategies in the absence of labelled data [3,4], exploiting training data involving other languages [5,6], multi‐task learning [7], the selection of hyperparameters [8], and pre‐trained language model fine‐tuning [9]. Systems in low‐resource scenarios, e.g., augmenting training data from source and/or tar‐. Translation strategy of Sennrich et al [1] is less effective in low‐resource settings where it is hard to train a good back‐translation model [10]; unsupervised MT does not work well for distant languages [11] due to the difficulty of training unsupervised cross‐lingual word embeddings for such languages [12], and the same is applicable in the case of trans‐. This line of research needs more attention from the MT research community. In this context, we refer interested readers to some of the papers [14,15] that compared phrase‐based statistical machine translation (PB‐SMT) and NMT on a variety of use‐cases. As for low‐resource scenarios, as mentioned above, many studies (e.g., Koehn and Knowles [16], Östling and Tiedemann [17], Dowling et al [18]) found that PB‐SMT

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call