Sentiment analysis of imbalanced Arabic data using sampling techniques and classification algorithms

Manar A Jaradat,Marwah Alian,Maisa J Al-Khazaleh

doi:10.11591/eei.v13i1.5886

Abstract

Sentiment analysis is a popular natural language processing task that recognizes the opinions or feelings of a piece of text. Microblogging platforms such as Twitter are a valuable resource for finding such people’s opinions. The majority of Arabic sentiment analysis studies indicated that the data utilized to train machine learning algorithms is balanced. In this paper, we investigated the impact of sampling techniques and classification algorithms on an imbalanced Arabic dataset about people’s perceptions of COVID-19, with the majority of opinions reflecting people’s fear and stress about the pandemic, and the minority reflecting the belief that the pandemic was a hoax. The experiments concentrated on analyzing the imbalanced learning of Arabic sentiments using over-sampling and under-sampling techniques on seven single machine learning algorithms and two common ensemble algorithms from the bagging and boosting families, respectively. Results show that resampling-based approaches can overcome the difficulty of an imbalanced dataset, and the use of over-sampled data leads to better performance than that of under-sampled data. The results also reveal that using oversampled data from synthetic minority over-sampling technique (SMOTE), borderline-SMOTE, or adaptive synthetic sampling with random forest classifier is the most effective in addressing this classification problem, with F1-score value of 0.99.

Full Text