Abstract

Script identification for handwritten document image is an open document analysis problem especially for multilingual optical character recognition (OCR) system. To design the OCR system for multi-script document pages, it is essential to recognise different scripts before running a particular OCR system of a script. The present work reports an intelligent feature-based technique for word-level script identification in multi-script handwritten document pages. At first, the text lines and then the words are extracted from the document pages. A set of 39 distinctive features have been designed of which eight features are topological and the rest (31) are based on convex hull for each word image. For selection of a suitable classifier, performances of multiple classifiers are evaluated with the designed feature set on multiple subsets of freely available database CMATERdb1.5.1 (http://www.code.google.com/p/cmaterdb), which comprises of 150 handwritten document pages containing both Devnagari and Roman script words. Statistical significance tests on these performance measures declare MLP to be the best performing one. The overall word-level script identification accuracy with MLP classifier on the said database is observed as 99.74%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call