Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques

Ramin Ghorbani,Rouzbeh Ghousi

doi:10.1109/access.2020.2986809

Abstract

In today's world, due to the advancement of technology, predicting the students' performance is among the most beneficial and essential research topics. Data Mining is extremely helpful in the field of education, especially for analyzing students' performance. It is a fact that predicting the students' performance has become a severe challenge because of the imbalanced datasets in this field, and there is not any comparison among different resampling methods. This paper attempts to compare various resampling techniques such as Borderline SMOTE, Random Over Sampler, SMOTE, SMOTE-ENN, SVM-SMOTE, and SMOTE-Tomek to handle the imbalanced data problem while predicting students' performance using two different datasets. Moreover, the difference between multiclass and binary classification, and structures of the features are examined. To be able to check the performance of the resampling methods better in solving the imbalanced problem, this paper uses various machine learning classifiers including Random Forest, K-Nearest-Neighbor, Artificial Neural Network, XG-boost, Support Vector Machine (Radial Basis Function), Decision Tree, Logistic Regression, and Naive Bayes. Furthermore, the Random hold-out and Shuffle 5-fold cross-validation methods are used as model validation techniques. The achieved results using different evaluation metrics indicate that fewer numbers of classes and nominal features will lead models to better performance. Also, classifiers do not perform well with imbalanced data, so solving this problem is necessary. The performance of classifiers is improved using balanced datasets. Additionally, the results of the Friedman test, which is a statistical significance test, confirm that the SVM-SMOTE is more efficient than the other resampling methods. Moreover, The Random Forest classifier has achieved the best result among all other models while using SVM-SMOTE as a resampling method.

Highlights

Recent advancement in several fields has led to a large amount of collected data [1]
RESULTS & DISCUSSION This paper tries to show the effect of imbalanced data problem and handle this problem using various resampling methods; determining the best resampling method and the best classifier compare to all other models and examining the difference between multiclass and binary classification and the importance of the features’ structure are among the aims of this paper
This study intends to show the effect of imbalanced data problem and find the best resampling method among the different methods of handling the imbalanced data problem, namely Borderline SMOTE, Random Over Sampler, SMOTE, Support Vector Machine (SVM)-SMOTE, SMOTE-Edited Nearest Neighbors (ENN), and SMOTETomek

Summary

Introduction

Recent advancement in several fields has led to a large amount of collected data [1]. Since analyzing the considerable amount of data to reach useful information is a tedious task for humankind, data mining techniques can be used to discover valuable and significant knowledge from the data [2]. It is well-known that universities are operating in a very complex and highly competitive environment [3], [4]. The main challenge for universities is to examine their performance profoundly, identify their uniqueness, and build tactics for further development and future achievements [5].

Objectives

Methods

Results

Conclusion