Abstract

Tibetan text classification is a basic task in Tibetan natural language processing. Based on large-scale pre-trained language model and fine-tuning is the current mainstream text classification model. However, Tibetan lacks open source large-scale text and pre-trained language model. To solve the above problems, this paper crawls a large-scale Tibetan text dataset, and trains a Tibetan pre-trained language model (bert-base-tibetan) based on the corpus. On the basis of this model, the experimental results on a variety of text classification models based on neural network show that the pre-trained language model can significantly improve the performance of Tibetan text classification (F1 value is increased by 9.3% on average), which verifies the value of the Tibetan pre-trained language model in Tibetan text classification task and other related Tibetan natural language processing tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call