Undersampling dan K-Fold Random Forest Untuk Klasifikasi Kelas Tidak Seimbang

Laila Qadrini

doi:10.47065/bits.v4i4.3141

Abstract

Classification in Data Mining is a process of modelling that explains and differentiates data classes intending to estimate the class of an object whose class is unknown. Classification can be applied in various aspects so over time quite a lot of classification algorithms have been developed, but some problems are often encountered in classification, namely the problem of data imbalance. An imbalanced class is a condition where there are several data where the number of classes is not balanced or there is a significant difference in each number of classes. Most classification datasets do not have the same number of classes. However, the class imbalance is not a problem when the comparison between classes is not much different. Class imbalance can cause problems if left untreated because the resulting model predictions will tend to the majority group so that the contribution of the minority class to the model is small. One of the algorithms that are often used to handle unbalanced classes is the resampling algorithm. The purpose of this research is to apply the Resampling Undersampling Random Forest and Random Forest K-Fold Undersampling Algorithms to the Breast Cancer Diagnostic dataset from UCI Machine Learning. Undersampling was chosen because it produces better accuracy than oversampling. Recall accuracy for the K-Fold 10 Random Forest Algorithm is 83% and for Recall Undersampling Random Forest is 65%.

Full Text