A Generative-Adversarial Approach to Low-Resource Language Translation via Data Augmentation

Linda Zeng

doi:10.47611/jsrhs.v12i4.5664

Abstract

Language and culture preservation is a serious challenge both socially and technologically. This paper proposes a novel, data augmentation approach to using machine learning to translate low-resource languages. Since low-resource languages, such as Aymara and Quechua, do not have many available translations that machine learning software can use as reference, machine translation models frequently err when translating to and from low-resource languages. Because models learn the syntactic and lexical patterns underlying translations through processing the training data, an insufficient amount of data hinders them from producing accurate translations. In this paper, I propose the novel application of a generative-adversarial network (GAN) to automatically augment low-resource language data. A GAN consists of two competing models, with one learning to generate sentences from noise and the other trying to tell if a given sentence is real or generated. My experiments show that even when training on a very small amount of language data (< 20,000 sentences) in a simulated low-resource setting, such a model is able to generate original, coherent sentences, such as "ask me that healthy lunch im cooking up,” and “my grandfather work harder than your grandfather before.” The first of its kind, this novel application of a GAN is effective in augmenting low-resource language data to improve the accuracy of machine translation and provides a reference for future experimentation with GANs in machine translation.

Full Text