Ar-PuFi: A short-text dataset to identify the offensive messages towards public figures in the Arabian community

Mohamed Abdelhakim,Bingquan Liu,Chengjie Sun

doi:10.1016/j.eswa.2023.120888

Abstract

The fight against offensive speech on the Internet necessitates increased efforts from linguistic analysis and artificial intelligence perspectives to develop countermeasures and preventive methods. Reliable predictions can only be obtained if these methods are exposed to a representative sample of the domain or environment under consideration. Datasets serve as the foundation for significant developments in this field because they are the main means of obtaining appropriate instances that reveal the multiple and varied faces of the offensive speech phenomenon. In this sense, we present Ar-PuFi, a dataset of offensive speech towards Public Figures in the Arabian community. With 24,071 comments collected from TV interviews with Egyptian celebrities belonging to six domains of public interest, Ar-PuFi is currently the largest Arabic dataset in terms of its category and size. The examples were annotated by three native speakers over the course of two months and are provided with two-class and six-class annotations based on the presence or absence of explicit and implicit offensive content. We evaluated the performance of a diverse set of classification models employing several text representations of actual examples (e.g., N-gram, TF/IDF, AraVec, and fastText), and AraBERT achieved the baseline for the new dataset in both offensive detection and group classification. Additionally, we apply the Pointwise Mutual Information (PMI) technique to comments within the target domain in order to derive a lexicon of offensive terms associated with each domain of ArPuFi. We further explored whether active learning (AL) or meta-learning (ML) frameworks could be used to reduce the labeling effort required for our dataset without affecting prediction quality and found that, though AL can reduce the amount of data annotations by 10% over the ML approach, neither approach requires less than about 70% of the full dataset to achieve baseline performance. Finally, we took advantage of the availability of relevant datasets and conducted a cross-domain experiment to back up our claims not only about the uniqueness of our dataset but also about the difficulty of adapting Arabic dialects against one another.

Full Text