Abstract

Building parallel resources for corpus based machine translation, especially Statistical Machine Translation (SMT), from comparable corpora has recently received wide attention in the field Machine Translation research. In this paper, we propose an automatic approach for extraction of parallel fragments from comparable corpora. The comparable corpora are collected from Wikipedia documents and this approach exploits the multilingualism of Wikipedia. The automatic alignment process of parallel text fragments uses a textual entailment technique and Phrase Based SMT (PBSMT) system. The parallel text fragments extracted thus are used as additional parallel translation examples to complement the training data for a PBSMT system. The additional training data extracted from comparable corpora provided significant improvements in terms of translation quality over the baseline as measured by BLEU.

Highlights

  • Comparable corpora have recently attracted huge interest in natural language processing research

  • Parallel text extracted from comparable corpora are typically added with the training corpus as additional training material which is expected to facilitate better performance of Statistical Machine Translation (SMT) systems for low density language pairs

  • Our work shows that only a small ad-hoc corpus containing Wikipedia articles could prove to be beneficial for existing machine translation (MT) systems

Read more

Summary

Introduction

Comparable corpora have recently attracted huge interest in natural language processing research. Comparable corpora are considered as a rich resource for acquiring parallel resources such as parallel corpus or parallel text fragments,. Parallel text extracted from comparable corpora can take an important role in improving the quality of machine translation (MT) (Smith et al 2010). Parallel text extracted from comparable corpora are typically added with the training corpus as additional training material which is expected to facilitate better performance of SMT systems for low density language pairs. We try to extract English−Bengali parallel fragments of text from comparable corpora. We have collected document aligned corpus of English−Bengali document pairs from Wikipedia which provides a huge collection of documents in many different languages. For automatic alignment of parallel fragments we have used two-way textual entailment (TE) system and a baseline SMT system

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call