Abstract

We are witnessing an increasing proliferation of hate speech on social media targeting individuals for their protected characteristics. Our study aims to devise an effective Arabic hate and offensive speech detection framework to address this serious issue. First, we built a reliable Arabic textual corpus by crawling data from Twitter using four robust extraction strategies that we implement based on four types of hate: religion, ethnicity, nationality, and gender. Next, we label the corpus based on a three-hierarchical annotation scheme in which we verify the inter annotation agreement to ensure ground truth at each level. Based on machine and deep learning techniques, we develop numerous two-class, three-class, and six-class classification models that we combine with a variety of feature extraction techniques, such as contextual word embeddings. Finally, we conduct an intensive experiment to assess the performance of the different learned models and to examine the misclassification errors. The performance results are very encouraging compared to prior hate and offensive speech studies carried out on Arabic and other languages.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call