MACEDONIZER - The Macedonian Transformer Language Model

Jovana Dobreva,Stojancho Tudzarski,Tashko Pavlov,Monika Simjanoska,Ljupcho Kocarev,Kostadin Mishev,Dimitar Trajanov

doi:10.1007/978-3-031-22792-9_5

Abstract

AbstractContextualized language models are becoming omnipresent in the field of Natural Language Processing (NLP). Their learning representation capabilities show dominant results in almost all downstream NLP tasks. The main challenge that low-resource languages face is the lack of language-specific language models since their pre-training process requires high-computing capabilities and rich resources of textual data. This paper describes our efforts to pre-train the first contextual language model in the Macedonian language (MACEDONIZER), pre-trained on a 6.5 GB corpus of Macedonian texts crawled from public web domains and Wikipedia. Next, we evaluate the pre-trained version of the model on three different downstream tasks: Sentiment Analysis (SA), Natural Language Inference (NLI) and Named Entity Recognition (NER). The evaluation results are compared to the cross-lingual version of the RoBERTa model - XML-RoBERTa. The results show that MACEDONIZER achieves state-of-the-art results in all downstream tasks. Finally, the pre-trained version of the MACEDONIZER is made for free usage and further task-specific fine-tuning via HuggingFace.KeywordsMACEDONIZERContextualized language modelMacedonianSentiment analysisNatural Language InferenceNamed Entity RecognitionPre-training

Full Text