A robust method for line and word segmentation in handwritten text

Abdelaali Hassaine

doi:10.5339/qfarf.2013.ictp-057

Abstract

Line and word segmentation is a key-step in any document image analysis system. It can be used for instance in handwriting recognition when separating words before their recognition. Line segmentation can also serve as a prior step before extracting the geometric characteristics of lines which are unique to each writer. Text line and word segmentation is not an easy task because of the following problems: 1) text lines do not all have the same direction in the handwritten text; 2) text lines are not always horizontal which makes their separation more difficult; 3) characters may overlap between successive text lines; 4) it is often confusing to distinguish between inter and intra word distances. In our method, line segmentation is done by using a smoothed version of the handwritten document which makes it possible to detect the main line components using a subsequent thresholding algorithm. The connected components of the resulting image are then assigned to a separate label which represents a line component. Then, each text region which intersects only with one line component is assigned to the same label of that line component. The Voronoi diagram of the image thus obtained is then computed in order to label the remaining text pixels. Word segmentation is performed by computing a generalized Chamfer distance in which the horizontal distance is slightly favored. This distance is subsequently smoothed in order to reflect the distances between word components and neglect the distance to dots and diacritics. Word segmentation is then performed by thresholding the distance thus obtained. The threshold depends on the characteristics of the handwriting. We have therefore computed several features in order to predict it, including: the sum of maximum distances within each line component, the number of connected components within the document and the average width and height of lines. The optimal threshold is then obtained by training a linear regression of those features on a training set of about 100 documents. This method achieved the best performance on the ICFHR Handwriting Segmentation Contest dataset reaching a matching score of 97.4% on line segmentation and 91% on word segmentation. The method has also been tested on the QUWI Arabic dataset reaching 97.1% on line segmentation and 49.6% on word segmentation. The relatively low performance of word segmentation in Arabic script is due to the fact that words are very close to each other with respect to English script. The proposed method tackles most of the problems of line and word segmentation and achieves high segmentation results. It can however be improved by combining it with a handwriting recognizer which will eliminates words which are not recognized.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A robust method for line and word segmentation in handwritten text

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Line and word segmentation of Kannada handwritten text documents using projection profile technique
K. L. Banumathi ... A. P. Jagadeesh Chandra
-
K. L. Banumathi, et. al.K. L. Banumathi ... A. P. Jagadeesh Chandra
01 Dec 2016
01 Dec 2016

A novel approach to text line and word segmentation on odia printed documents
D Senapati ... M Nayak
-
D Senapati, et. al.D Senapati ... M Nayak
01 Jul 2012
01 Jul 2012

A Review of Various Line Segmentation Techniques Used in Handwritten Character Recognition
Solley Joseph ... Jossy George
-
Solley Joseph, et. al.Solley Joseph ... Jossy George
23 Jun 2022
23 Jun 2022

An approach to analysis of arabic text documents into text lines, words, and characters
Hakim A Abdo ... Ahmed Abdu
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 26
Hakim A Abdo, et. al.Hakim A Abdo ... Ahmed Abdu
01 May 2022
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A robust method for line and word segmentation in handwritten text

Abstract

Talk to us

Similar Papers