Abstract
Two different methods of corpus cleaning are presented in this article. One is a machine-assisted technique, which is good to clean small-sized parallel corpus, and the other is an automatic method, which is suitable for cleaning large-sized parallel corpus. A baseline SMT (MOSES) system is used to evaluate these methods. The machine-assisted technique used two features: word alignment and length of the source and target language sentence. These features are used to detect mistranslations in the corpus, which are then handled by a human translator. Experiments of this method are conducted on the English-to-Indian Language Machine Translation (EILMT) corpus (English-Hindi). The Bilingual Evaluation Understudy (BLEU) score is improved by 0.47% for the clean corpus. Automatic method of corpus cleaning uses a combination of two features. One feature is length of source and target language sentence and the second feature is Viterbi alignment score generated by Hidden Markov Model for each sentence pair. Two different threshold values are used for these two features. These values are decided by using a small-sized manually annotated parallel corpus of 206 sentence pairs. Experiments of this method are conducted on the HindEnCorp corpus, released in the workshop of the Association of Computational Linguistics (ACL 2014). The BLEU score is improved by 0.6% on clean corpus. A comparison of the two methods is also presented on EILMT corpus.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Asian and Low-Resource Language Information Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.