Word-wise script identification based on morphological reconstruction in printed bilingual documents

B.V Dhandra,H Mallikarjun,V.S Malemath,R Hegadi

doi:10.1049/cp:20060562

Abstract

Owing to the diversity of languages and script, English has proven to be the binding language in India. So, a line of a bilingual document page may contain text words in regional language and numerals in English. For optical character recognition (OCR) of such a document page it is necessary to identify script forms before running individual OCR of the scripts. In this paper an automatic technique for script identification at word level based on morphological reconstruction is proposed for two printed bilingual documents of Telugu and Devnagari containing English numerals as the common script. The technique developed includes a feature extractor and the classifiers. The feature extractor consists of two stages. In the first stage, morphological erosion and opening by reconstruction is carried out on a document image in horizontal and vertical directions using the line structuring element. The length of the structuring element is fixed based on the average height of all the connected components of an image. In the next stage, average pixel distribution is found in these resulting images. The nearest neighbor and k-nearest neighbor algorithms are used to classify new word images. The proposed algorithm is tested on 1500 sample words with various font styles and sizes. The results obtained are quite encouraging.

Full Text