Employing CNN to Identify Noisy Documents Thereafter Accomplishing Text Line Segmentation

Bhagesh Seraogi,Bidyut B Chaudhuri,Rahul Roy,Supriya Das,Srinivas Mukkamala,Purnendu Banerjee,Himadri Majumder

doi:10.1109/tencon.2018.8650333

Abstract

Due to the presence of high volume of noise in text documents, it becomes very difficult to achieve high accuracy while performing text line segmentation. Hence, approaches which are dependent on the performance of the line segmentation stage, also suffers. Also due to the variability of noise patterns, the text contents get distorted and sometimes it results in broken strokes of a character. Thus, character level recognition becomes a challenging task for noisy documents. Our proposed method is aimed to address the aforesaid challenges. Since, we are considering the input image to be of noisy kind, a convolutional neural network (CNN) architecture is introduced to initially identify whether the input image contains noise or not. We consider this as a two-class problem and try to identify the respective classes, i.e. noisy or clean. Then, we apply the proposed piecewise projection profile feature and an adaptive region growing based two-stage algorithmic approach which initially identifies the text line upper and lower boundaries in a noisy text document image and then try to regroup the broken strokes of a character to enhance the character recognition accuracy. The proposed method has been tested on a large size dataset containing noisy documents and also different quality measures are computed to establish the effectiveness of the proposed method. From the measures, it can be noted that the text line segmentation accuracy is comparable to other similar state-of-the-art results.

Full Text