Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Pit Schneider

doi:10.46298/jdmdh.7277

Abstract

Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.

Highlights

While adapting modern open source OCR software to reprocess a large collection of historic newspapers, ranging from years 1841 to 1954, the National Library of Luxembourg (BnL) explored ways to segment the newspaper scans into individual text lines
To running faster than BENCH, the aim was for the method to only take up a fraction of the time needed for the OCR pipeline’s character recognition functionality
The final but essential step of COMBISEG is the analysis of horizontal histogram projections, one created for every bounding box stored in boxes

Summary

CONTEXT

While adapting modern open source OCR software to reprocess a large collection of historic newspapers, ranging from years 1841 to 1954, the National Library of Luxembourg (BnL) explored ways to segment the newspaper scans into individual text lines Pursuing this goal, a method was developed that integrates into a larger OCR pipeline by sitting just in between the binarization and font recognition processes. That’s in the form of the kraken.pageseg.segment function that mostly relies on different filters from scipy.ndimage (Virtanen et al [2020]) This algorithm, as published with version 2.0.8 1 and in the following referred to as BENCH, essentially served as a benchmark in the context of the development of an own solution, designed for the precise needs of BnL. The subsequent section evaluates the method and draws the comparison to BENCH

ALGORITHM

Input Assumption

Segmentation

Morphology

Components

Histogram

Output

Parameters

Value Determination

EXPERIMENTAL RESULTS

Evaluation Method

Postprocessing

Results

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Mining & Digital Humanities

Lead the way for us

Journal: Journal of Data Mining & Digital Humanities	Publication Date: Nov 4, 2021
License type: CC BY 4.0

Similar Papers

Text line segmentation from struck-out handwritten document images
Palaiahnakote Shivakumara ... Tong Lu
Expert Systems With Applications | VOL. 210
Palaiahnakote Shivakumara, et. al.Palaiahnakote Shivakumara ... Tong Lu
25 Jul 2022
Expert Systems With Applications | VOL. 210

On Machine-Learning Morphological Image Operators
Nina S T Hirata ... George A Papakostas
Mathematics | VOL. 9
Nina S T Hirata, et. al.Nina S T Hirata ... George A Papakostas
05 Aug 2021
Mathematics | VOL. 9

Text line and word segmentation of handwritten documents
G Louloudis ... C Halatsis
Pattern Recognition | VOL. 42
G Louloudis, et. al.G Louloudis ... C Halatsis
04 Jan 2009
Pattern Recognition | VOL. 42

Historical Text Line Segmentation Using Deep Learning Algorithms: Mask-RCNN against U-Net Networks.
Florian Côme Fizaine ... Michel Paindavoine
Journal of imaging | VOL. 10
Florian Côme Fizaine, et. al.Florian Côme Fizaine ... Michel Paindavoine
05 Mar 2024
Journal of imaging | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Mining &amp; Digital Humanities

More From: Journal of Data Mining & Digital Humanities