Boosting English-Amharic machine translation using corpus augmentation and Transformer

Yohannes Biadgligne,Kamel Smaili

doi:10.59671/mbulj

Abstract

The Transformer-based neural machine translation (NMT) model has been very successful in recent years and has become a new mainstream method. However, using them in lowresourced languages requires large amounts of data and efficient model configuration (hyperparameter tuning) mechanisms. The scarcity of parallel texts is a bottleneck for high quality (N) MTs, especially for under resourced languages like Amharic. As a result, this paper presents an attempt to improve English-Amharic MT by introducing three different vanilla Transformer architectures, with different hyper-parameter values. To obtain additional training material, offline token level corpus augmentation was applied to the previously collected English-Amharic parallel corpus. Compared to previous work on Amharic MT, the best of the three Transformer models have achieved state-of-the-art BLEU scores. In fact, we were able to achieve this result by employing corpus augmentation techniques and hyper-parameter tuning.

Full Text