Abstract
This paper presents the study we have carried out to address the problem of unbalanced datasets in supervised sentiment classification in an Arabic context. We propose three different methods to under-sample the majority class documents. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We also aim to evaluate the behaviour of the classifier toward different under-sampling rates. We use three different common classifiers, namely Naive Bayes, support vector machines and k-nearest neighbours. The experiments are carried out on two different Arabic datasets that we have built internally. We show that results obtained on the first dataset, which is slightly skewed, are better than those obtained on the second one which is highly skewed. We conclude also that Naive Bayes is sensitive to dataset size, the more we reduce the data the more the results degrade. However, support vector machines are highly sensitive to unbalanced datasets. We record an instable behaviour of k-nearest neighbour. The results show also that we can rely on the proposed techniques and that they are typically competitive with random under-sampling.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Information and Communication Technology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.