Abstract

AbstractContextualized language models are becoming omnipresent in the field of Natural Language Processing (NLP). Their learning representation capabilities show dominant results in almost all downstream NLP tasks. The main challenge that low-resource languages face is the lack of language-specific language models since their pre-training process requires high-computing capabilities and rich resources of textual data. This paper describes our efforts to pre-train the first contextual language model in the Macedonian language (MACEDONIZER), pre-trained on a 6.5 GB corpus of Macedonian texts crawled from public web domains and Wikipedia. Next, we evaluate the pre-trained version of the model on three different downstream tasks: Sentiment Analysis (SA), Natural Language Inference (NLI) and Named Entity Recognition (NER). The evaluation results are compared to the cross-lingual version of the RoBERTa model - XML-RoBERTa. The results show that MACEDONIZER achieves state-of-the-art results in all downstream tasks. Finally, the pre-trained version of the MACEDONIZER is made for free usage and further task-specific fine-tuning via HuggingFace.KeywordsMACEDONIZERContextualized language modelMacedonianSentiment analysisNatural Language InferenceNamed Entity RecognitionPre-training

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.