Abstract

Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation result based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system.

Highlights

  • Machine translation is an important application in the field of Computational Linguistics and Natural Language Processing (NLP) whose aim is to translate texts from one natural language to another natural language in an automatic fashion

  • Though corpus construction is a very difficult and laborious task, but it could be enhanced in language education, language technology, linguistic research, and NLP tasks

  • The Statistical Machine Translation (SMT) system has been tested with various numbers of parallel sentences of the two domains parallel corpora separately and achieved different translation results

Read more

Summary

INTRODUCTION

Machine translation is an important application in the field of Computational Linguistics and NLP whose aim is to translate texts from one natural language to another natural language in an automatic fashion. The MT is a very challenging research task in NLP and the demand of it is growing in the world, especially in India. Lots of MT systems have been developed in India as well as all over the world using several pairs of major natural languages, such as English to (Arabic, Bengali, Chinese, French, Hindi, Japanese, Spanish, and Urdu). Though a considerable amount of work has already been done in different Indian languages in the field on NLP, still not much work has been done, especially on MT system for Bodo language due to the lack of a comprehensive set of parallel corpora. It has been decided to construct General and Newspaper domains English-Bodo Parallel Text Corpora (E-BPTC) to develop the English to Bodo SMT system using Phrase-Based SMT approach. Bodo language, English languages, corpus, and SMT approach are briefly discussed

Bodo Language
English Language
Corpus
Statistical Machine Translation
REVIEW OF RELATED CORPUS CONSTRUCTION
CONSTRUCTION OF E-BPTC
Newspaper Domain E-BPTC The
IMPLEMENTATION OF ENGLISH TO BODO SMT
RESULT, EVALUATION, AND COMPARISON
CONCLUSION AND FUTURE RESEARCH
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.