Incorporated preprocessing and physical layout analysis of a binary document image using a two stage classification

Hamed Behin,Afshin Ebrahimi,Sepideh Ebrahimi

doi:10.1109/iccce.2010.5556766

Abstract

Before the image of a document enter an OCR module, it should undergo Preprocessing and Document Layout Analysis steps. Document layout analysis usually comes after preprocessing. Noise removal and skew correction are two major preprocessing operations. Document layout analysis itself is divided into physical and logical layout analysis. Physical layout analysis decomposes the image of a document into homogenous regions such as "text", "graphics", and "lines". In physical layout analysis, first, the image is segmented to homogenous regions, and then each homogenous region is classified into one of the present classes. On the other hand, logical layout analysis tries to assign functional labels (such as "title", "author", and "footnote") to some of the classified regions to find relationship between some regions, and to discover reading order of different parts of a document. This article presents an innovative method for preprocessing and physical layout analysis of binary documents. Although, most of the present systems give the result of preprocessing to document layout analysis; state-of-the-art algorithms try to postpone the processing operations as much as possible in order to prevent irreparable mistakes. These two steps are incorporated in our approach. This is achieved through using segmentation results for noise removal. One reason of effectiveness of this approach is appropriate arrangement of procedures. Also, a neural classifier is so trained that the output is robust to the skew. A two stage classification is used for determining pixel classes. In the first step, the Haar wavelet transform is computed on resized and gray leveled image. The coefficients are normalized, and then 10% of them are picked up randomly. Selected coefficients are clustered into 4 groups using Kmeans. A novel algorithm is introduced for assigning these 4 clusters to a background, a vertical, and two horizontal classes. Other wavelet coefficients are also classified to one of these classes by KNN algorithm. The results of this stage as well as other features help a MLP network to perform the classification of the second stage. As well as regular classes, an ambiguous class is considered to take the regions that are the result of erroneous segmentation. Using the statistics of connected component sizes and horizontal projection profile, the regions are re-segmented and reclassified by another neural network. The presented approach is designed for textual documents with horizontal text extensions, and is applicable for vertical text extension manuscripts by a little change. As well as the proposed method has a fair computational complexity and robustness to skew, it has offered satisfactory results on different types of databases such as magazine, book, newspaper, and official letters.

Full Text