Abstract

In this paper we present the development of an English-Bangla transliteration parallel corpus and used it to develop and evaluate some of the popular computational models to transliterate Bangla texts written in Romanized English, back to its original script. Accordingly, we have developed differen t techniques to generate an English-Bangla parallel transliterated lexicon of around 100,000 words. The proposed lexicon of English-Bangla transliterated word pairs along with some of the language specific orthographic as well as phonetic information rules are used to develop two different computational models namely, the joint source channel model and the phrase based SMT model, to automatically identify, extract and learn the transliteration unit (TU) pairs from both the source and target language words. Both the models are used to predict the top 5 best possible outcome of the given input text. Both the models have been evaluated with a set of 20000 Romanized transliterated Bangla test words. Our initial evaluation results clearly shows that performance of the SMT model slightly surpasses the performance joint source channel model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call