Document Page Images Research Articles

Segmentation of text lines and words in an unconstrained handwritten or a machine-printed degraded document is a challenging document analysis problem due to the heterogeneity in the document structure. Often there is un-even skew between the lines and also broken words in a document. In this article, the contribution lies in segmentation of a document page image into lines and words. We have proposed an unsupervised, robust, and simple statistical method to segment a document image that is either handwritten or machine-printed (degraded or otherwise). In our proposed method, the segmentation is treated as a two-class classification problem. The classification is done by considering the distribution of gap size (between lines and between words) in a binary page image. Our method is very simple and easy to implement. Other than the binarization of the input image, no pre-processing is necessary. There is no need of high computational resources. The proposed method is unsupervised in the sense that no annotated document page images are necessary. Thus, the issue of a training database does not arise. In fact, given a document page image, the parameters that are needed for segmentation of text lines and words are learned in an unsupervised manner. We have applied our proposed method on several popular publicly available handwritten and machine-printed datasets (ISIDDI, IAM-Hist, IAM, PBOK) of different Indian and other languages containing different fonts. Several experimental results are presented to show the effectiveness and robustness of our method. We have experimented on ICDAR-2013 handwriting segmentation contest dataset and our method outperforms the winning method. In addition to this, we have suggested a quantitative measure to compute the level of degradation of a document page image.

Read full abstract

Abstract Word searching or keyword spotting is an important research problem in the domain of document image processing. The solution to the said problem for handwritten documents is more challenging than for printed ones. In this work, a two-stage word searching schema is introduced. In the first stage, all the irrelevant words with respect to a search word are filtered out from the document page image. This is carried out using a zonal feature vector, called pre-selection feature vector, along with a rule-based binary classification method. In the next step, a holistic word recognition paradigm is used to confirm a pre-selected word as search word. To accomplish this, a modified histogram of oriented gradients-based feature descriptor is combined with a topological feature vector. This method is experimented on a QUWI English database, which is freely available through the International Conference on Document Analysis and Recognition 2015 competition entitled “Writer Identification and Gender Classification.” This technique not only provides good retrieval performance in terms of recall, precision, and F-measure scores, but it also outperforms some state-of-the-art methods.

Read full abstract

Document Page Images Research Articles

Related Topics

Articles published on Document Page Images

An Unsupervised and Robust Line and Word Segmentation Method for Handwritten and Degraded Printed Document

Script Identification for Printed and Handwritten Indian Documents: An Empirical Study of Different Feature Classifier Combinations

Automatic Abstraction of Combinational Logic Circuit from Scanned Document Page Images

Development of a Two-Stage Segmentation-Based Word Searching Method for Handwritten Document Images

A segmentation-free word spotting method for historical printed documents

A Unified Algorithm for Identification of Various Tabular Structures from Document Images

Page segmentation for document image analysis using a neural network

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Document Page Images Research Articles

Related Topics

Articles published on Document Page Images

An Unsupervised and Robust Line and Word Segmentation Method for Handwritten and Degraded Printed Document

Script Identification for Printed and Handwritten Indian Documents: An Empirical Study of Different Feature Classifier Combinations

Automatic Abstraction of Combinational Logic Circuit from Scanned Document Page Images

Development of a Two-Stage Segmentation-Based Word Searching Method for Handwritten Document Images

A segmentation-free word spotting method for historical printed documents

A Unified Algorithm for Identification of Various Tabular Structures from Document Images

Page segmentation for document image analysis using a neural network