Kannada text line extraction based on energy minimization and skew correction

Sunanda Dixit,Suresh Hosahalli Narayan,Mahesh Belur

doi:10.1109/iadcc.2014.6779295

Abstract

There are many governmental, cultural, commercial and educational organizations that manage large number of manuscript textual information. Kannada being one of the official languages of South India, such organizations include Kannada handwritten documents. Text line segmentation in such documents remains an open document analysis problem. Detection and correction of skew angle of the segmented text lines become another important step in document analysis. Most of the segmentation algorithms, for skewed text lines, present in the literature today are sensitive to the degree of skew, direction of skew, and spacing between adjacent lines. In this paper, proposed method for the text line extraction and skew correction of the extracted text lines uses a new cost function, which considers the spacing between text lines and the skew of each text line is used. Precisely, the problem is formulated as an energy minimization problem so that the minimization of the cost function yields a set of text lines. Further it is required to efficiently correct baseline skew and fluctuations of these text lines. This proposed method also uses an efficient algorithm for baseline correction. It consists of normalizing the lower baseline to a horizontal line using a skating window approaches, thus, avoiding the segmentation of text lines into subparts. This approach copes with baselines which are skewed, fluctuating, or both. It differs from machine learning approaches which need manual pixel assignments to baselines. Experimental results show that this baseline correction approach highly improves performance.

Full Text