Abstract

India is a multilingual country with 22 official languages and more than 1600 languages in existence. Kannada is one of the official languages and widely used in the state of Karnataka whose population is over 65 million. Kannada is one of the south Indian languages and it stands in the 33rd position among the list of widely spoken languages across the world. However, the survey reveals that much more effort is required to develop a complete Optical Character Recognition (OCR) system. In this direction the present research work throws light on the development of suitable methodology to achieve the goal of developing an OCR. It is noted that the overall accuracy of the OCR system largely depends on the accuracy of the segmentation phase. So it is desirable to have a robust and efficient segmentation method. In this paper, a method has been proposed for proper segmentation of the text to improve the performance of OCR at the later stages. In the proposed method, the segmentation has been done using horizontal projection profile and windowing. The result obtained is passed to the recognition module. The Histogram of Oriented Gradient (HoG) is used for the recognition in combination with the support vector machine (SVM). The result is taken as the feedback and fed to the segmentation module to improve the accuracy. The experimentation is delivered promising results.

Highlights

  • Optical character recognition (OCR) refers to a process of transforming the images of either handwritten or printed document to a machine readable and editable format

  • Step 3: Obtaining the Histogram of Oriented Gradient (HoG) descriptor: In order to nullify the effect of illumination and shading the cell histograms obtained in step 2 need to be normalized

  • The normalized cell histogram values are represented in the form of a vector and this is called as HoG descriptor

Read more

Summary

INTRODUCTION

Optical character recognition (OCR) refers to a process of transforming the images of either handwritten or printed document to a machine readable and editable format. All OCR systems have the following stages: image preprocessing, segmentation, extraction of features and recognition of characters. In the segmentation of document images, first we extract the lines the words and the characters. Segmentation of characters from a document is still a open challenge in the are of developing efficient OCR systems. Because of the large dataset and structural complexity, the development of OCR for some of the Indian languages like kannada and telugu is considered to be a tedious task [1]. To add to these complexities in some cases the characters may overlap with each other.

LITERATURE
PROPOSED METHOD
The Recognition Module
EXPERIMENTS AND RESULTS
Method
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call