Abstract

The problem of a total absence of parallel data is present for a large number of language pairs and can severely detriment the quality of machine translation. We describe a language-independent method to enable machine translation between a low-resource language (LRL) and a third language, e.g. English. We deal with cases of LRLs for which there is no readily available parallel data between the low-resource language and any other language, but there is ample training data between a closely-related high-resource language (HRL) and the third language. We take advantage of the similarities between the HRL and the LRL in order to transform the HRL data into data similar to the LRL using transliteration. The transliteration models are trained on transliteration pairs extracted from Wikipedia article titles. Then, we automatically back-translate monolingual LRL data with the models trained on the transliterated HRL data and use the resulting parallel corpus to train our final models. Our method achieves significant improvements in translation quality, close to the results that can be achieved by a general purpose neural machine translation system trained on a significant amount of parallel data. Moreover, the method does not rely on the existence of any parallel data for training, but attempts to bootstrap already existing resources in a related language.

Highlights

  • Human languages contribute significantly to the cultural and linguistic heritage of humankind

  • The following sections describe the data, the transliteration method employed for transforming the data from the high-resource language (HRL) to the low-resource languages (LRL), and the neural Machine translation (MT) system workflow used in our experiments

  • This section presents the results achieved for the experiments Belarusian ↔ English using Russian. It is divided in two parts; first, we explore the efficiency of the transliteration method in a neural machine translation (NMT) application (System 1), and second, we experiment with back-translating monolingual LRL data with System 1 and using the resulting parallel corpus to train our final models (System 2)

Read more

Summary

Introduction

Human languages contribute significantly to the cultural and linguistic heritage of humankind. Natural language processing (NLP) can play a significant role in the survival and further development of all languages by offering state-of-the-art tools and applications to the speakers. MT can contribute to the rapid spread of vital information, such as in a crisis or emergency. A notable example is the recent refugee crisis where information has to be transferred rapidly from a large number of Asian languages (Farsi, Dari, Pashto) into European Languages and vice versa. Many of these languages and a large number of other languages of the world are considered low-resource languages (LRL), because they lack in linguistic resources, e.g. grammars, POS taggers, corpora. For MT, the problem is further exacerbated by the lack of large amounts of quality parallel resources for training MT systems

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call