Comparative Multinomial Text Classification Analysis of Naïve Bayes and XGBoost with SMOTE on Imbalanced Dataset

Mahendra Kanojia,Santosh Yadav,Ashish Chaturvedi,Mohd Abuzar Mohd Haroon Ansari

doi:10.1007/978-981-16-2543-5_29

Abstract

AbstractIn supervised machine learning, with an imbalanced dataset, achieving better classification in minority classes is a major challenge. In such situation, machine learning model shows biasness toward majority classes, which result into poor performance in another set of classes. This paper examined how Synthetic Minority Over-Sampling Technique (SMOTE) techniques help in multinomial text classification on the imbalanced dataset. The performance of SMOTE was examined with Naive Bayes (NB) and Extreme Gradient Boosting (XGBoost) algorithms. 701 questions were collected from college students residing in Mumbai, related to their lifestyle problems. Results showed that XGBoost with SMOTE (XGBoost + SMOTE) technique worked better on an imbalanced dataset in comparison with NB with SMOTE (NB + SMOTE), NB without SMOTE (NB-SMOTE), and XGBoost without SMOTE (XGBoost-SMOTE) techniques. The average classification accuracy for Naive Bayes (with and without SMOTE) was 68.0% while the average accuracy for XGBoost was 71.0%. In the selection of XGBoost and NB, researcher can opt for XGBoost with SMOTE technique to work on the multinomial imbalanced dataset.KeywordsXGBoostImbalanced dataSMOTENaïve BayesTfidfVectorizer

Full Text