A Systematic Literature Review on Extraction of Parallel Corpora from Comparable Corpora

Dilshad Kaur,Satwinder Singh

doi:10.3844/jcssp.2021.924.952

Abstract

In today’s Globalized Scenario, the requirement for translation is high and increasing rapidly in the number of fields, but it is difficult to translate everything manually. Machine Translation, which is dependent on corpora availability, is a medium for meeting this high demand for translation. Parallel corpora are used to gain most translation knowledge. But, the number and quality of parallel corpora are critical. Because parallel corpora are not readily accessible for many different language pairs, comparable corpora that are widely accessible can be used to extract parallel corpora. A systematic literature survey is performed on 188 research articles that are published in premier journals, conferences, workshops and book chapters. The research process is carried out while considering the research questions. Different MT systems along with their features are identified. Several datasets and techniques for bilingual lexicon extraction, parallel sentence and fragment extraction are revealed. A proposed architecture and a mind map are also showcased in this review article to provide better clarity regarding parallel data extraction using comparable corpora. The study of the paper will increase readers' understanding of parallel data mining through bilingual lexicons, parallel sentences and fragments.

Highlights

In today’s era of globalization, a lot of data is accessible on the internet in diverse languages and domains
There must be the number of other MT systems available such as Neural Machine Translation, but this review has only focused only on those MT systems which are mentioned in the collection of 188 papers of this literature survey
Out of 188 papers, 34% literature review is done on the works under the term “Machine Translation” whereas 9% of papers are found on “Statistical Machine Translation”

Summary

Introduction

In today’s era of globalization, a lot of data is accessible on the internet in diverse languages and domains. Corpus is an enormous assortment of text used to analyze how the words, phrases and language are used It is used by linguists, social scientists, natural language processing experts, etc. The parallel corpus comprises two different language corpora where one is the translation of another. Languages are very vast and complex, it is quite impossible to write the rules manually in a relatively short period. To address this issue, the emphasis was shifted to statistical analysis. Maskara and Bhattacharyya (2018) focused on the recent developments in the field of parallel sentence mining from CC using techniques like word embedding, deep learning and machine translation systems. This study discussed three approaches to create parallel corpus i.e., Sentence Alignment approach, Web Mining approach and Manual approach

Methods

Results

Conclusion