Aligning Sentences in English-Bengali Corpora

Raihan Ahmed,Mohammad Reza Selim,Mehedi Al Hasan

doi:10.1109/ic4me2.2018.8465608

Abstract

For different areas of Natural Language Processing (NLP) research parallel corpora are an important resource. Parallel corpora aligned at sentence level is more efficient and useful than parallel corpora which are not aligned for various applications like Cross-Language Information Retrieval and Statistical Machine Translation. Although there exist many sources for bilingual corpora they do not appear in sentence aligned form. So developing an efficient method to align the sentences in such parallel corpora is an important step for NLP research. Researchers of NLP invested much effort to develop efficient methods for aligning sentences in such corpora and several methods have been developed which have been proved to be effective for different language pairs. As far as we are concerned till now no previous work has been done for aligning sentences in English-Bengali parallel corpora which is lagging us behind in NLP research. So our goal was to develop an efficient method for aligning sentences from English-Bengali parallel corpora. We evaluated the performance of some existing methods for our intended language pair and choose the best one for our work. We upgraded the selected method to make it exploit lexical information of the language pair to attain a better result.

Full Text