An Automatic and a Machine-assisted Method to Clean Bilingual Corpus

Jyoti Srivastava,Ashish Kumar Srivastava,Sudip Sanyal

doi:10.1145/3342351

Abstract

Two different methods of corpus cleaning are presented in this article. One is a machine-assisted technique, which is good to clean small-sized parallel corpus, and the other is an automatic method, which is suitable for cleaning large-sized parallel corpus. A baseline SMT (MOSES) system is used to evaluate these methods. The machine-assisted technique used two features: word alignment and length of the source and target language sentence. These features are used to detect mistranslations in the corpus, which are then handled by a human translator. Experiments of this method are conducted on the English-to-Indian Language Machine Translation (EILMT) corpus (English-Hindi). The Bilingual Evaluation Understudy (BLEU) score is improved by 0.47% for the clean corpus. Automatic method of corpus cleaning uses a combination of two features. One feature is length of source and target language sentence and the second feature is Viterbi alignment score generated by Hidden Markov Model for each sentence pair. Two different threshold values are used for these two features. These values are decided by using a small-sized manually annotated parallel corpus of 206 sentence pairs. Experiments of this method are conducted on the HindEnCorp corpus, released in the workshop of the Association of Computational Linguistics (ACL 2014). The BLEU score is improved by 0.6% on clean corpus. A comparison of the two methods is also presented on EILMT corpus.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Automatic and a Machine-assisted Method to Clean Bilingual Corpus

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Oct 9, 2019
Citations: 1

Similar Papers

Does BLEU Score Work for Code Migration?
Ngoc Tran ... Tien Nguyen
-
Ngoc Tran, et. al.Ngoc Tran ... Tien Nguyen
01 May 2019
01 May 2019

Phrase-Based Named Entity Transliteration on Myanmar-English Terminology Dictionary
Aye Myat Mon ... Khin Mar Soe
-
Aye Myat Mon, et. al.Aye Myat Mon ... Khin Mar Soe
05 Nov 2020
05 Nov 2020

Phrase Table Combination Based on Symmetrization of Word Alignment for Low-Resource Languages
Sari Dewi Budiwati ... Al Hafiz Akbar Maulana Siagian
Applied Sciences | VOL. 11
Sari Dewi Budiwati, et. al.Sari Dewi Budiwati ... Al Hafiz Akbar Maulana Siagian
20 Feb 2021
Applied Sciences | VOL. 11

Spelling Correction of Non-Word Errors in Uyghur–Chinese Machine Translation
Rui Dong ... Yating Yang
Information | VOL. 10
Rui Dong, et. al.Rui Dong ... Yating Yang
06 Jun 2019
Information | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Automatic and a Machine-assisted Method to Clean Bilingual Corpus

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing