Abstract

Determination of script type of document image is a complex real life problem for a multi-script country like India, where 23 official languages (including English) are present and 13 different scripts are used to write them. Including English and Roman those count become 23 and 13 respectively. The problem becomes more challenging when handwritten documents are considered. In this paper an approach for identifying the script type of handwritten document images written by any one of the Bangla, Devnagari, Roman and Urdu script is proposed. Two convolution based techniques, namely Gabor filter and Morphological reconstruction are combined and a feature vector of 20 dimensions is constructed. Due to unavailability of a standard data set, a corpus of 157 document images with an almost equal ratio of four types of script is prepared. During classification the dataset is divided into 2:1 ratio. An average identification accuracy rate of 94.4% is obtained on the test set. The average Bi-script and Tri-script identification accuracy rate was found to be 98.2% and 97.5% respectively. Statistical performance analysis is done using different well known classifiers.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call