Segmentation of Handwritten Document Images into Text Lines

Vassilis Katsouros,Vassilis Papavassiliou

doi:10.5772/15923

Abstract

There are many governmental, cultural, commercial and educational organizations that manage large number of manuscript textual information. Since the management of information recorded on paper or scanned documents is a hard and time-consuming task, Document Image Analysis (DIA) aims to extract the intended information as a human would (Nagy, 2000). The main subtasks of DIA (Mao et al. 2003) are: i) the document layout analysis, which aims to locate the “physical” components of the document such as columns, paragraphs, text lines, words, tables and figures, ii) the document content analysis, for understanding/labelling these components as titles, legends, footnotes, etc. iii) the optical character recognition (OCR) and iv) the reconstruction of the corresponding electronic document. The proposed algorithms that address the above-mentioned processing stages come mainly from the fields of image processing, computer vision, machine learning and pattern recognition. Actually, some of these algorithms are very effective in processing machineprinted document images and therefore they have been incorporated in the workflows of well-known OCR systems. On the contrary, no such efficient systems have been developed for handling handwritten documents. The main reason is that the format of a handwritten manuscript and the writing style depend solely on the author's choices. For example, one could consider that text lines in a machine-printed document are of the same skew, while handwritten text lines may be curvilinear. Text line segmentation is a critical stage in layout analysis, upon which further tasks such as word segmentation, grouping of text lines into paragraphs, characterization of text lines as titles, headings, footnotes, etc. may be developed. For instance, a task for text-line segmentation is involved in the pipeline of the Handwritten Address Interpretation System (HWAIS), which takes a postal address image and determines a unique delivery point (Cohen et al., 1994). Another application, in which text line extraction is considered as a preprocessing step, is the indexing of George Washington papers at the Library of Congress as detailed by Manmatha & Rothfeder, 2005. A similar document analysis project, called the Bovary Project, includes a text-line segmentation stage towards the transcription of the manuscripts of Gustave Flaubert (Nicolas et al., 2004a). In addition, many recent projects, which focus on digitisation of archives, include activities for document image understanding in terms of automatic or semi-automatic extraction and indexing of metadata such as titles, subtitles, keywords, etc. (Antonacopoulos & Karatzas, 2004, Tomai et al., 2002). Obviously, these activities include text-line extraction.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Segmentation of Handwritten Document Images into Text Lines

Abstract

Talk to us

Similar Papers

Lead the way for us

Publication Date: Apr 19, 2011
Citations: 18	License type: cc-by-nc-sa

Similar Papers

Neural Networks for Document Image and Text Processing
Joan Pastor Pellicer
-
Joan Pastor PellicerJoan Pastor Pellicer
03 Nov 2017
03 Nov 2017

A Hybrid Method for Text Line Extraction in Handwritten Document Images
Ehsan Kiumarsi ... Alireza Alaei
-
Ehsan Kiumarsi, et. al.Ehsan Kiumarsi ... Alireza Alaei
01 Aug 2018
01 Aug 2018

Spotting Separator Points at Line Terminals in Compressed Document Images for Text-line Segmentation
Amarnath R ... P Nagabhushan
International Journal of Computer Applications | VOL. 172
Amarnath R, et. al.Amarnath R ... P Nagabhushan
17 Aug 2017
International Journal of Computer Applications | VOL. 172

High Performance Layout Analysis of Arabic and Urdu Document Images
Syed Saqib Bukhari ... Thomas M Breuel
-
Syed Saqib Bukhari, et. al.Syed Saqib Bukhari ... Thomas M Breuel
01 Sep 2011
01 Sep 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Segmentation of Handwritten Document Images into Text Lines

Abstract

Talk to us

Similar Papers