An adaptive over-split and merge algorithm for page segmentation

Ha Dai-Ton,Nguyen Duc-Dung,Le Duc-Hieu

doi:10.1016/j.patrec.2016.06.011

Ha Dai-Ton, Nguyen Duc-Dung + Show 1 more

https://doi.org/10.1016/j.patrec.2016.06.011

Copy DOI

Abstract

Page segmentation is a key step in building a document recognition system. Variation in character font sizes, narrow spacing between text blocks, and complicated structure are main causes of the most common over-segmentation and under-segmentation errors. We propose an adaptive over-split and merge algorithm to reduce simultaneously these types of error. The document image is firstly over-split into text blocks, even text lines. These text blocks are then considered to merge into text regions using a new adaptive thresholding method. Local context analysis uses a set of text line separators to split homogeneous text regions of similar font size and close text blocks into paragraphs. Experiments on the ICDAR2009 and UW-III benchmarking datasets show the effectiveness of the proposed algorithm in reducing both the under and over-segmentation errors and boost the performance significantly when comparing with popular page segmentation algorithms.

Full Text