Research of data mining methods for classification of imbalanced data sets

A V Doroshenko,D Y Savchuk

doi:10.23939/ujit2024.01.048

Abstract

With the rapid development of information technology, which is widely used in all spheres of human life and activity, extremely large amounts of data have been accumulated today. By applying machine learning methods to this data, new practically useful knowledge can be obtained. The main goal of this paper is to study different machine learning methods for solving the classification problem and compare their efficiency and accuracy. A separate task is data pre-processing aimed at solving the problem of sample imbalance, as well as identifying the principal components that will be used to solve the classification problem. For this purpose, an information system for classifying the bankruptcy of a company with specified economic and financial characteristics was researched and developed. The study uses a dataset on the basis of which the efficiency and quality of application of several existing classification algorithms are evaluated. These classifiers are: conventional and linear Support Vector Machine, Extra Trees, Random Forest, Decision Tree, Logistic Regression, Multilayer perceptron Classifier, Gradient Boosting, Naive Bayes Classifier. For data pre-processing, we scaled the data, used the SMOTE method to get rid of the imbalance of the training sample, and performed principal component analysis and L1 regularisation. Principal component analysis allowed us to identify 15 principal components that have the greatest impact on classification accuracy and, accordingly, use them in the classification process. Analysing the results, we found that the best classifier was Random Forest with 95.9 % accuracy, and the worst was Naive Bayes with 85.1 %. To evaluate the quality of classification and select the best classifier, the Confusion matrix is used, which takes into account the number of true positive (TP) and true negative (TN) values, as well as the number of false negative (FN) and false positive (FP) classification results, and the values of such metrics as accuracy, precision, sensitivity, F1, and ROC. Accuracy is the percentage of correct answers given by the algorithm, while Recall is the number of TPs divided by the number of TPs plus the number of FNs. F1 indicates the balance between accuracy and sensitivity. Precision is the number of true positive predictions divided by the number of false positive and true negative predictions. ROC AUC is a tool for measuring performance for classification tasks at different thresholds. It shows how well a model can distinguish between classes. The conclusions present the main results of the study and indicate the main future direction of the work, namely, the study of classification results for other datasets and more efficient processing and analysis.

Full Text