Efficient character segmentation approach for machine-typed documents

Vladan Vučković,Boban Arizanović

doi:10.1016/j.eswa.2017.03.027

Abstract

In this paper an efficient approach for segmentation of the individual characters from scanned documents typed on old typewriters is proposed. The approach proposed in this paper is primarily intended for processing of machine-typed documents, but can be used for machine-printed documents as well. The proposed character segmentation approach uses the modified projection profiles technique which is based on using the sliding window for obtaining the information about the document image structure. This is followed by histogram processing in order to determine the spaces between lines, words and characters in the document image. The decision-making logic used in the process of character segmentation is describes and represents the most an integral aspect of the proposed technique. Beside the character segmentation approach, the ultra-fast architecture for geometrical image transformations, which is used for image rotation in the process of skew correction, is presented, and its fast implementation using pointer arithmetic and a highly optimized low-level machine routine is provided. The proposed character segmentation approach is semi-automatic and uses threshold values to control the segmentation process. Provided results for segmentation accuracy show that the proposed approach outperforms the state-of-the-art approaches in most cases. Also, the results from the aspect of the time complexity show that the new technique performs faster than state-of-the-art approaches and can process even very large document images in less than one second, which makes this approach suitable for real-time tasks. Finally, visual demonstration of the proposed approach performances is achieved using original documents authored by Nikola Tesla.

Full Text