Abstract
Over the past decade, through a mixture of optical character recognition and manual input, there is now a growing corpus of Tibetan literature available as e-texts in Unicode format. With the creation of such a corpus, the techniques of text analytics that have been applied in the analysis of English and other modern languages may now be applied to Tibetan. In this work, we narrow our focus to examine a modest portion of that literature, the Mind-section portion of the literature of the Tibetan tradition of the Great Perfection. Here, we will use the lens of text analytics tools based on machine learning techniques to investigate a number of questions of interest to scholars of this and related traditions of the Great Perfection. It has been necessary for us to participate in all portions of this process: corpora identification and text edition selection, rendering the text as e-texts in Unicode using both Optical Character Recognition and manual entry, data cleaning and transformation, implementation of software for text analysis, and interpretation of results. For this reason, we hope this study can serve as a model for other low-resource languages that are just beginning to approach the problem of providing text analytics for their language.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Asian and Low-Resource Language Information Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.