Line, Word, and Character Segmentation from Bangla Handwritten Text—A Precursor Toward Bangla HOCR

Payel Rakshit,Subhankar Ghosh,Kaushik Roy,Chayan Halder

doi:10.1007/978-981-10-8180-4_7

Abstract

The basic functionalities of optical character recognition (OCR) are to recognize and extract text to digitally editable text from document images. Apart from this, an OCR has other potentials in document image processing such as in automatic document sorter, writer identification/verification. In current situation, various commercially available OCR systems can be found mostly for Roman script. Development of an unconstrained offline handwritten character recognition system is one of the most challenging tasks for the research community. Things get more complicated when we consider Indic scripts like Bangla which contains more than 280 modified and compound characters along with isolated characters. For recognition of handwritten document, the most convenient way is to segment the text into characters or character parts. So line, word and character level segmentation plays a vital role in the development of such a system. In this paper, a scheme for tri-level segmentation (line, word, and character) is presented. Encouraging segmentation results are achieved on a set of 50 handwritten text documents.

Full Text