Abstract

The statistical machine translation approach is highly popular in automatic translation research area and promising approach to yield good accuracy. Efforts have been made to develop Urdu to Punjabi statistical machine translation system. The system is based on an incremental training approach to train the statistical model. In place of the parallel sentences corpus has manually mapped phrases which were used to train the model. In preprocessing phase, various rules were used for tokenization and segmentation processes. Along with these rules, text classification system was implemented to classify input text to predefined classes and decoder translates given text according to selected domain by the text classifier. The system used Hidden Markov Model(HMM) for the learning process and Viterbi algorithm has been used for decoding. Experiment and evaluation have shown that simple statistical model like HMM yields good accuracy for a closely related language pair like Urdu-Punjabi. The system has achieved 0.86 BLEU score and in manual testing and got more than 85% accuracy.

Highlights

  • The machine translation is a burning topic in the area of artificial intelligence

  • There are many machine translation systems which have been developed for Indo-Aryan languages [Garje G V, 2013]

  • Resource poor languages: Urdu and Punjabi languages are new in natural language processing area like any other Indo-Aryan language

Read more

Summary

INTRODUCTION

The machine translation is a burning topic in the area of artificial intelligence In this digital era where across the world different communities are connected to each other and sharing a vast amount of resources. In this kind of digital environment, different natural languages are the main obstacle to communicate. Various kinds of approaches have been developed to decode natural languages like Rule based, Example-based, Statistical and various hybrid approaches. Among all these approaches, statistical based approach is a quite dominant and popular in the machine translation research community. Collecting parallel phrases were more convenient as compared to the parallel sentences

URDU AND PUNJABI: A CLOSELY RELATED LANGUAGE PAIR
Resource poor languages
Spelling variation
Free word order
Segmentation issues in Urdu
Morphological rich languages
Word without diacritical marks
METHODOLOGY
Tokenization and segmentation process
Text Classification
Translation and Language model Training
EXPERIMENT AND EVALUATION
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call