Abstract

The present paper investigates the effect of corpus augmentation on the quality of English-Amharic Machine Translation (MT) with the goal of improving translation quality of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models for the language pairs. Actually, for the sake of this investigation tri-gram and four-gram SMT language models, as well as NMT models based on Gated Recurrent Units (GRU) and Recurrent Neural Network (RNN) models with attention mechanism were created. To observe how the corpus augmentation affects the translation quality of these models; we trained them separately by using our original corpus and the augmented one. These corpora (original and augmented) contain 225,304 and 450,608 English-Amharic parallel sentences, respectively. To complete the corpus augmentation challenge, an offline token level tokenization technique was used. This technique was used before any other MT processes were started. Among several token-level tokenization mechanisms, random insertion, replacement, deletion, and swapping approaches were chosen and implemented. After the models had been trained, the Bilingual Evaluation Understudy (BLEU) scores were collected and analyzed. The results demonstrate that the models trained with the augmented corpus outperform their corresponding models (models trained with the original corpus) in terms of BLEU scores. So, from this we can conclude that corpus augmentation did indeed help in the improvement of the performance of both SMT and NMT translation systems.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.