Abstract

This paper proposes a method for Named-Entity Recognition (NER) for a low-resource language, Tigrinya, using a pre-trained language model. Tigrinya is a morphologically rich, although one of the underrepresented in the field of NLP. This is mainly due to the limited amount of annotated data available. To address this problem, we present the first publicly available datasets of NER for Tigrinya containing two versions, namely, (V1 and V2) annotated manually. The V1 and V2 datasets contain 69,309 and 40,627 tokens, respectively, where the annotations are based on the CoNLL 2003 Beginning, Inside, and Outside (BIO) tagging schema. Specifically, we develop a new pre-trained language model for Tigrinya based on RoBERTa, which we refer to as TigRoBERTa. Our model is then fine-tuned on down-stream tasks on a more specific target NER and POS tasks with limited data. Finally, we further enhance the model performance by applying semi-supervised self-training using unlabeled data. The experimental results show that the method achieved 84% F1-score for NER and 92% accuracy for POS tagging, which is better than or comparable to the baseline method based on the CNN-BiLSTM-CRF.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.