Separation of Machine Printed Roman and Gurmukhi Script Words

Dharamveer Sharma

doi:10.1007/978-3-642-12214-9_84

Abstract

In a multi-lingual country like India, a document may contain more than one script forms. For such a document, it is necessary to separate different script forms before feeding them to OCRs of respective scripts. In the work presented in this paper, a successful attempt has been made to identify the script at the word level in a bilingual document containing Roman and Gurmukhi scripts. The technique presented here can separate English and Punjabi words present in a single document. In this approach English and Punjabi words are separated using certain features of Gurmukhi and Roman script. Words with various font styles and sizes have been used for the testing of the proposed algorithms and the results are quite encouraging. The system has an overall accuracy of 98.78% of identification.

Full Text