Abstract This paper presents a streamlined taxonomy for categorizing offensive language in Arabic, specifically Modern Standard Arabic (MSA) and the Levantine dialect. Addressing a gap in the existing literature, which has mainly focused on Indo-European languages, our taxonomy divides offensive language into seven levels (six explicit and one implicit). We adapted our framework from the simplified offensive language (SOL) taxonomy by (Lewandowska-Tomaszczyk, Barbara, Slavko Žitnik, Anna Bączkowska, Chaya Liebeskind, Jelena Mitrovic & Giedre Valunaite Oleškeviciente. 2021a. Lod-connected offensive language ontology and tagset enrichment. In Shubert R. Carvalho & Renato R. Souza (eds.), Proceedings of the workshops and tutorials held at ldk 2021 co-located with the 3rd language, data and knowledge conference, Vol. 3064, 135–150. CEUR Workshop Proceedings), customizing it to reflect the unique linguistic and cultural nuances of Arabic. To validate this taxonomy, we created a new dataset from various social media platforms, primarily focusing on Twitter. This dataset was manually curated by human annotators and is described in detail within the paper, serving as both a validation tool for our taxonomy and a foundation for future research on offensive language detection in Arabic. Initial analysis of the dataset reveals complex patterns of offensive expressions in MSA and Levantine Arabic, underscoring the need to account for linguistic and cultural variations in studying online abuse. Our taxonomy and dataset are vital for advancing research in Arabic sociocultural studies, natural language processing, and linguistic analysis, and contribute to the study of low-resource languages.
Read full abstract