Abstract

Binarization methods play a central role in document image processing. It is usually performed in the preprocessing stage and is important for document image processing tasks such as optical character recognition (OCR). Segmentation of text from badly degraded document images is a challenging task because of the high inter/intra-variation between the document background and foreground text of different document images. So method for segmenting the foreground text from the background is presented here. In this method first of all an image having high contrast has been constructed. For this a rough estimation of background is to be made. Then a hybrid algorithm for thresholding has been used. It consists of both global and local thresholding methods. The Global thresholding step has been modified such that the output will not be a binarized image but an intermediate gray level image. It is helpful as most of the background gets eliminated. Local thresholding will be applied on the result given by global thresholding step. This method is simple, robust and effective. The proposed method works better than most of the existing local and global thresholding algorithms and is able to deal with degradations which occur due to strain, ink bleed through, low contrast, water marks, dust, smear and uneven illumination etc. This method has been tested on three public datasets that are used in recent document image binarization contest (DIBCO) 2009 and 2011and handwritten-DIBCO 2011and achieves the results which are significantly higher than or close to the best performing methods reported in three contests. Also to show the superior performance of the proposed method compared with other techniques, experiments have been performed on more challenging bickley diary dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call