A Survey of Orthographic Information in Machine Translation

Bharathi Raja Chakravarthi,John P Mccrae,Priya Rani,Mihael Arcan

doi:10.1007/s42979-021-00723-4

Bharathi Raja Chakravarthi, John P Mccrae + Show 2 more

Open Access

https://doi.org/10.1007/s42979-021-00723-4

Copy DOI

Abstract

Machine translation is one of the applications of natural language processing which has been explored in different languages. Recently researchers started paying attention towards machine translation for resource-poor languages and closely related languages. A widespread and underlying problem for these machine translation systems is the linguistic difference and variation in orthographic conventions which causes many issues to traditional approaches. Two languages written in two different orthographies are not easily comparable but orthographic information can also be used to improve the machine translation system. This article offers a survey of research regarding orthography’s influence on machine translation of under-resourced languages. It introduces under-resourced languages in terms of machine translation and how orthographic information can be utilised to improve machine translation. We describe previous work in this area, discussing what underlying assumptions were made, and showing how orthographic knowledge improves the performance of machine translation of under-resourced languages. We discuss different types of machine translation and demonstrate a recent trend that seeks to link orthographic information with well-established machine translation methods. Considerable attention is given to current efforts using cognate information at different levels of machine translation and the lessons that can be drawn from this. Additionally, multilingual neural machine translation of closely related languages is given a particular focus in this survey. This article ends with a discussion of the way forward in machine translation with orthographic information, focusing on multilingual settings and bilingual lexicon induction.

Highlights

Natural language processing (NLP) plays a significant role in keeping languages alive and the development of languages in the digital device era [1]
The main goal of this survey is to shed light on how orthographic information is utilised in the machine translation (MT) system development and how orthography helps to overcome the data sparsity problem for under-resourced languages
The authors studied how to use the closely-related languages from the Dravidian language family to exploit the similar syntax and semantic structures by phonetic transcription of the corpora into Latin script along with image feature to improve the translation quality [127]. They showed that orthographic information improves the translation quality in multilingual Neural Machine Translation (NMT) [128]

Summary

Introduction

Natural language processing (NLP) plays a significant role in keeping languages alive and the development of languages in the digital device era [1]. SN Computer Science (2021) 2:330 large amounts of high-quality parallel resources or linguists to make a vast set of rules This survey studies how to take advantage of the orthographic information and closely related languages to improve the translation quality of underresourced languages. The main goal of this survey is to shed light on how orthographic information is utilised in the MT system development and how orthography helps to overcome the data sparsity problem for under-resourced languages. More it tries to explain the nature of interactions with orthography with different types of machine translation. This survey ends with a discussion of the future directions towards utilising the orthographic information

Background

Discussion

Conclusion