Anti-Islamic Arabic Text Categorization using Text Mining and Sentiment Analysis Techniques

Rawan Abdullah Alraddadi,Moulay Ibrahim El-Khalil Ghembaza

doi:10.14569/ijacsa.2021.0120889

Rawan Abdullah Alraddadi, Moulay Ibrahim El-Khalil Ghembaza

Open Access

https://doi.org/10.14569/ijacsa.2021.0120889

Copy DOI

Abstract

The aim of this research is to detect and classify websites based on their content if it encourages spreading hate speech toward Islam and Muslims, or Islamophobia using sentiment analysis and web text mining techniques. In this research, a large dataset corpus has been collected, to identify and classify anti-Islamic online contents. Our target is to automatically detect the content of those websites that are hostile to Islam and transmitting extremist ideas against it. The main purpose is to reduce the spread of those webpages that give the wrong idea about Islam. The proper dataset is collected from different sources, and the two datasets for the Arabic language (balanced and unbalanced) have been produced. The framework of the proposed approach has been described. The approach used in this framework is based on supervised Machine Learning (ML) approach using Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB) models as classifiers, and Term Frequency-Inverse Document Frequency (TF-IDF) as feature extraction. Different experiments including word level and tri-gram level on the two datasets have been conducted, and compared the obtained results. The experimental results shows that the supervised ML approach using word level is the finest approach for both datasets that produce high accuracy with 97% applied on the balanced Arabic dataset using SVM algorithm with TF-IDF as feature extraction. Finally, an interactive web-application prototype has been developed and built in order to detect and classify toxic language such as anti-Islamic online text-contents.

Highlights

AND BACKGROUNDIslamophobia has escalated in the past decades and this has shown in the real world as well as on online websites
When Term Frequency-Inverse Document Frequency (TF-IDF) is used, the True Positive (TP), which is the number of correct predictions, is 6161 for the Machine Learning (ML) model on word level with non-balanced Arabic dataset, and on trigram level with non-balanced Arabic dataset the number of correct predictions is 1573
The results demonstrated that the Support Vector Machines (SVM) was the best classifier in terms of accuracy, and it outperforms the Multinomial Naive Bayes (MNB) classifier in almost all experiments

Summary

Introduction

Islamophobia has escalated in the past decades and this has shown in the real world as well as on online websites This problem has affected Muslim communities especially those living in non-Muslim foreign countries, or any places containing extremists with anti-Islam ideas. Nowadays, it moved to the Internet where webpages are created to attack the Muslim faith. There is a lack of research that addresses and solves the problems of classification in Arabic language, especially the classification of this type of texts This motivates us to search and explore methods and techniques to deal with this issue. The authors worked only with English content and not the Arabic content; whereas our objective concerns both Arabic language undertaken in this paper and English language conducted in our previous paper [2]

Objectives

Results

Conclusion