Pre-trained Language Model Based Tibetan Text Classification Method

Bo An Bo An,Congjun Long Congjun Long

doi:10.11922/sciencedb.o00114.00045

Abstract

Tibetan text classification is a basic task in Tibetan natural language processing. Based on large-scale pre-trained language model and fine-tuning is the current mainstream text classification model. However, Tibetan lacks open source large-scale text and pre-trained language model. To solve the above problems, this paper crawls a large-scale Tibetan text dataset, and trains a Tibetan pre-trained language model (bert-base-tibetan) based on the corpus. On the basis of this model, the experimental results on a variety of text classification models based on neural network show that the pre-trained language model can significantly improve the performance of Tibetan text classification (F1 value is increased by 9.3% on average), which verifies the value of the Tibetan pre-trained language model in Tibetan text classification task and other related Tibetan natural language processing tasks.

Full Text