A novel method of text line segmentation for historical document image of the uchen Tibetan

Zhenjiang Li,Weilan Wang,Yang Chen,Yusheng Hao

doi:10.1016/j.jvcir.2019.01.021

Abstract

Text line segmentation is a key step in Tibetan historical document recognition. A novel method for text line segmentation was proposed based on the baseline in uchen Tibetan, and a new dataset was released, which was used to evaluate the results of text line segmentation of uchen Tibetan historical documents. In this paper, there were two steps for the proposed method: baseline detection and text line segmentation using the baseline. In baseline detection, the upper edges of all characters in the document were obtained by a horizontal gradient operator, then an edge connectivity definition was proposed by which the upper edge set was divided into disjoint subsets. Eligible sets were selected from these subsets, and the edges in these sets were joined in turn to obtain the baseline. In text line segmentation, the document image was truncated at the baseline position, then the adhesion regions were segmented again. Each connected region in the image was assigned to its nearest baseline. All connected regions belonging to the same baseline formed a text line. Experiments on the proposed dataset showed that the method could effectively avoid document distortion, the accuracy of text line segmentation was high, and the text line adhesion could be handled.

Full Text