DzNER: A large Algerian Named Entity Recognition dataset

Mohamed Amine Cheragui,Abdelhalim Hafedh Dahou

doi:10.1016/j.nlp.2023.100005

Mohamed Amine Cheragui, Abdelhalim Hafedh Dahou

Open Access

https://doi.org/10.1016/j.nlp.2023.100005

Copy DOI

Abstract

Named Entity Recognition (NER) is a natural language processing (NLP) task that involves assigning labels like Person, Location, and Organization to words in text. While there is a good amount of annotated data available for NER in English and other European languages, this is not the case for Arabic and its dialects. The goal of the paper is to introduce DzNER, an Algerian dataset for NER that consists of more than 21,000 manually annotated sentences (over 220,000 tokens) from Algerian Facebook pages and YouTube channels, with a focus on three prominent classes. In this study, we provide a detailed analysis of the NER tag-set used in the dataset and show that it has a good balance of quantity, diversity, and coverage of different domains. For the proof of resource-effectiveness, we also demonstrate the effectiveness of the dataset by using various language models for the sequence labeling task of NER and comparing the results to existing datasets. According to our research and knowledge, currently no available dataset meets the standards of both variability and volume as well as DzNER. We hope that this dataset and the accompanying code and models will be useful for further research on NLP for Algerian dialect and fill the gap of low resources.

Full Text