Abstract

Owing to the diversity of languages and script, English has proven to be the binding language in India. So, a line of a bilingual document page may contain text words in regional language and numerals in English. For optical character recognition (OCR) of such a document page it is necessary to identify script forms before running individual OCR of the scripts. In this paper an automatic technique for script identification at word level based on morphological reconstruction is proposed for two printed bilingual documents of Telugu and Devnagari containing English numerals as the common script. The technique developed includes a feature extractor and the classifiers. The feature extractor consists of two stages. In the first stage, morphological erosion and opening by reconstruction is carried out on a document image in horizontal and vertical directions using the line structuring element. The length of the structuring element is fixed based on the average height of all the connected components of an image. In the next stage, average pixel distribution is found in these resulting images. The nearest neighbor and k-nearest neighbor algorithms are used to classify new word images. The proposed algorithm is tested on 1500 sample words with various font styles and sizes. The results obtained are quite encouraging.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.