Monolingual Pre-Trained Language Models for Tigrinya

Jong Park ,Fitsum Gaim ,Wonsuk Yang

doi:10.48448/h84h-rh13

Abstract

Pre-trained language models (PLMs) are driving much of the recent progress in natural language processing. Due to the resource-intensive nature of these models, however, under-represented languages without sizable curated data have not seen significant progress. Multilingual PLMs have been introduced with the potential to generalize across many languages. However, their performance fluctuates depending on the target language and trails when compared to their monolingual counterparts. In the case of the Tigrinya language, recent studies report a low performance when applying the current multilingual models. We believe the reasons are its orthography and linguistic properties, especially when compared to the Indo-European and other typologically distant languages that were used to train the models. In this work, we pre-train three monolingual PLMs for Tigrinya on a corpus that we compiled from news sources, and we compare the models with their multilingual counterparts on two downstream tasks – part-of-speech tagging and sentiment analysis – achieving significantly better results and establishing a state-of-the-art.

Full Text