Abstract

As a shining pearl in traditional Tibetan culture, historical Tibetan documents have received extensive attention from historians, linguists and Buddhist scholars. These documents are converted into digital form using Tibetan document segmentation and recognition methods. The document digitization is of great significance for the research, protection and inheritance of Tibetan history. This paper proposes an overall segmentation and recognition framework for historical Tibetan document images. Firstly, the historical Tibetan document image is preprocessed to correct imbalanced illumination, tilt and noises, and is further transformed into the binarized image. Secondly, we propose a layout segmentation method based on block projection to segment Tibetan document images into texts, lines and frames. Thirdly, in order to solve the problems of touching strokes between text-lines and curvilinear text-lines, we present a text-line segmentation method based on graph model for historical Tibetan text-line segmentation. Lastly, we present a touching segmentation method to segment touching Tibetan character string, and then recognize Tibetan characters. Experimental results show our proposed methods on layout segmentation, text-line segmentation and touching character string segmentation, achieve the satisfactory performance. The proposed methods can also be applied to other fonts in Tibetan font family.

Highlights

  • Tibetan is the first national script with international standards in China and one of the oldest scripts in the world

  • In order to extract texts accurately from historical Tibetan documents, we propose a layout segmentation method based on block projection

  • TEXT-LINE SEGMENTATION BASED ON GRAPH MODEL We propose a text-line segmentation method based on graph model for historical Tibetan documents

Read more

Summary

INTRODUCTION

Tibetan is the first national script with international standards in China and one of the oldest scripts in the world. There are 158 letters of precious historical documents selected into the national precious ancient books These documents have important cultural, historical and scientific values, and they play an. With the passage of time and human factors, large number of historical Tibetan documents with Tibetan paper as carriers have gradually become damaged and mildewed. The protection of these precious historical Tibetan documents is imminent. The research on the recognition of historical Tibetan document images is still in its infancy. We present a new touching Tibetan character string database from historical Tibetan document images [8].

RELATED WORK
OVERALL SEGMENTATION AND RECOGNITION FRAMEWORK
DOCUMENT IMAGE PREPROCESSING
EXPERIMENTAL RESULTS
TEXT-LINE SEGMENTATION
TOUCHING CHARACTER SEGMENTATION
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.