Abstract

This paper presents a textline detection method for degraded historical documents. Our method follows a conventional two-step procedure that the binarization is first performed and then the textlines are extracted from the binary image. In order to address the challenges in historical documents such as document degradation, structure noise, and skews, we develop new methods for the binarization and textline extraction. First, we improve the performance of binarization by detecting the non-text regions and processing only text regions. We also improve the textline detection method by extracting main textblock and compensating the skew angle and writing style. Experimental results show that the proposed method yields the state-of-the-art performance for several datasets.

Highlights

  • Historical documents are valuable cultural heritage and there are increasing demands to digitize them for archiving, indexing, and recognition purposes

  • 1.1 Textline detection in historical documents Textline detection is an essential step in many document processing tasks, and numerous methods have been proposed for decades

  • When large values are selected for these parameters, some textblocks can be classified as metadata, and Fig. 5e shows the proposed result. 5.3 connected components (CCs) grouping After extracting the CCs in the main textblocks, we find textlines in the textblocks by using the method in [25], which addressed the textline detection problem by partitioning extracted CCs into subsets corresponding to textlines

Read more

Summary

Introduction

Historical documents are valuable cultural heritage and there are increasing demands to digitize them for archiving, indexing, and recognition purposes. Historical documents suffer from various kinds of degradations and their understanding remains a challenging problem. We present a textline detection algorithm for historical documents, which is a key step to document understanding. 1.1 Textline detection in historical documents Textline detection is an essential step in many document processing tasks (e.g., layout analysis and optical character recognition), and numerous methods have been proposed for decades. Binarization is a challenging task due to degradations (e.g., bleed-through and faint characters) and structure noises. In addition to difficulties in binarization, historical documents suffer from a variety of challenges as they are mostly handwritten [25]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call