Abstract

Although parallel corpora are essential language resources for many natural language processing tasks, they are rare or even not available for many language pairs. Instead, comparable corpora are widely available and contain parallel fragments of information that can be used in applications like statistical machine translation systems. In this research, we propose a generative latent Dirichlet allocation based model for extracting parallel fragments from comparable documents without using any initial parallel data or bilingual lexicon. The experimental results show significant improvement if the extracted fragments generated by the proposed method are used for augmenting an existing parallel corpus in an statistical machine translation system. According to the human judgment, the accuracy of the proposed method for an English-Persian task is about 59.7%. Also, the out of vocabulary error rate for the same task is reduced by 28%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.