Abstract

The text lines of ancient Tibetan books are skewed and distorted, strokes are broken, and complex adjacent text lines touch each other, which makes text line segmentation extremely challenging. In this paper, a text line segmentation method based on local baselines and connected component allocation is proposed. First, the pseudotext line is detected by analyzing the horizontal projection, straight line detection and the average character height information, and then the local baseline position is determined in the pseudotext line area by the projection method. Second, the adhesion area detection is performed, which mainly includes the adhesion between characters and the adhesion between characters and strokes. The position relationship between the connected components is used to complete the adhesion between characters. A convolutional neural network is used to complete the adhesion between characters and strokes. Then, the watershed algorithm is used to segment touching connected components. Finally, the broken strokes are assigned to the text lines in which they belong according to the characteristics of Tibetan character structures. Subsequently, the assigned strokes are postprocessed to complete stroke correction, and finally the line segmentation is completed. Experiments show that this method can effectively reduce the influence of text line distortion and skew on text line segmentation, has a high degree of robustness, and has good segmentation accuracy for image text lines in Tibetan documents with touching and broken strokes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call