Abstract

India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Tamil, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 20 different document images containing about 600 text lines. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 100% is achieved.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.