TiBERT: Tibetan Pre-trained Language Model

Sisi Liu,Yuan Sun,Xiaobing Zhao,Junjie Deng

doi:10.1109/smc53654.2022.9945074

Abstract

The pre-trained language model is trained on large-scale unlabeled text and can achieve state-of-the-art results in many different downstream tasks. However, the current pre-trained language model is mainly concentrated in the Chinese and English fields. For low resource language such as Tibetan, there is lack of a monolingual pre-trained model. To promote the development of Tibetan natural language processing tasks, this paper collects the large-scale training data from Tibetan websites and constructs a vocabulary that can cover 99.95% of the words in the corpus by using Sentencepiece. Then, we train the Tibetan monolingual pre-trained language model named TiBERT on the data and vocabulary. Finally, we apply TiBERT to the downstream tasks of text classification and question generation, and compare it with classic models and multilingual pre-trained models, the experimental results show that TiBERT can achieve the best performance. Our model is published in http://tibert.cmli-nlp.con

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

TiBERT: Tibetan Pre-trained Language Model

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Pre-trained Language Models for Tagalog with Multi-source Data
Shengyi Jiang ... Yingwen Fu
-
Shengyi Jiang, et. al.Shengyi Jiang ... Yingwen Fu
01 Jan 2020
01 Jan 2020

Neural Transfer Learning For Vietnamese Sentiment Analysis Using Pre-trained Contextual Language Models
An Pha Le ... Tran Vu Pham
-
An Pha Le, et. al.An Pha Le ... Tran Vu Pham
16 Dec 2021
16 Dec 2021

Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation
Zhiqi Huang ... James Allan
-
Zhiqi Huang, et. al.Zhiqi Huang ... James Allan
27 Feb 2023
27 Feb 2023

Investigating Pre-trained Language Models on Cross-Domain Datasets, a Step Closer to General AI
Mohamad Ballout ... Kai-Uwe Kühnberger
Procedia Computer Science | VOL. 222
Mohamad Ballout, et. al.Mohamad Ballout ... Kai-Uwe Kühnberger
01 Jan 2023
Procedia Computer Science | VOL. 222

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TiBERT: Tibetan Pre-trained Language Model

Abstract

Talk to us

Similar Papers