Abstract
There are more than 1000 languages and 14 scripts used by 112 million people in India. All of these scripts divide the document in three parts: Text block, Image block, and Table block. In 21st century, there is a need, obvious reasons, to convert these old printed documents in digital form. Converting them manually is a huge and difficult task. Further it is prone to human errors. Another automated technique is to use Optical character recognition (OCR) system to convert the entire printed document image into editable document. In this paper, an effort has been made to develop OCR technique which converts the printed document into editable document. Firstly a scanned document is preprocessed for noise and skew correction. It is then followed by text-non text classification. Then text line detection has to be performed in the text area. There is no method available which can detect the text line if the image contains the multicolumn text area. In this paper the main contribution is to detect the blocks and detect the text lines in these detected blocks. The technique which can extract the text lines in image document is presented here. After extraction of text lines, word segmentation, character segmentation, and template matching can be performed.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have