New under-sampling methods to address the problem of unbalanced sentiment classification: application on Arabic datasets

Asmaa Mountassir,Ilham Berrada,Houda Benbrahim

doi:10.1504/ijict.2016.077687

Abstract

This paper presents the study we have carried out to address the problem of unbalanced datasets in supervised sentiment classification in an Arabic context. We propose three different methods to under-sample the majority class documents. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We also aim to evaluate the behaviour of the classifier toward different under-sampling rates. We use three different common classifiers, namely Naive Bayes, support vector machines and k-nearest neighbours. The experiments are carried out on two different Arabic datasets that we have built internally. We show that results obtained on the first dataset, which is slightly skewed, are better than those obtained on the second one which is highly skewed. We conclude also that Naive Bayes is sensitive to dataset size, the more we reduce the data the more the results degrade. However, support vector machines are highly sensitive to unbalanced datasets. We record an instable behaviour of k-nearest neighbour. The results show also that we can rely on the proposed techniques and that they are typically competitive with random under-sampling.

Full Text