Abstract

In a multi-lingual country like India, a document may contain more than one script forms. For such a document, it is necessary to separate different script forms before feeding them to OCRs of respective scripts. In the work presented in this paper, a successful attempt has been made to identify the script at the word level in a bilingual document containing Roman and Gurmukhi scripts. The technique presented here can separate English and Punjabi words present in a single document. In this approach English and Punjabi words are separated using certain features of Gurmukhi and Roman script. Words with various font styles and sizes have been used for the testing of the proposed algorithms and the results are quite encouraging. The system has an overall accuracy of 98.78% of identification.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call