VALIDATION ASSESSMENTS ON RESAMPLING METHOD IN IMBALANCED BINARY CLASSIFICATION FOR LINEAR DISCRIMINANT ANALYSIS

Ahmad Hakiim Jamaluddin,Nor Idayu Mahat

doi:10.32890/jict.20.1.2021.6358

Abstract

The curse of class imbalance affects the performance of many conventional classification algorithms including linear discriminant analysis (LDA). The data pre-processing approach through some resampling methods such as random oversampling (ROS) and random undersampling (RUS) is one of the treatments to alleviate such curse. Previous studies have attempted to address the effect of a resampling method on the performance of LDA. However, some studies contradicted with each other based on different performance measures as well as validation strategies. This manuscript attempted to shed more light on the effect of a resampling method (ROS or RUS) on the performance of LDA based on true positive rate and true negative rate through five validation strategies, i.e. leave-one-out cross-validation, k-fold cross-validation, repeated k-fold cross-validation, naive bootstrap, and .632+ bootstrap. 100 two-group bivariate normally distributed simulated and four real data sets with severe class imbalance ratio were utilised. The analysis on the location and dispersion statistics of the performance measures was further enlightened on: (i) the effect of a resampling method on the performance of LDA, and (ii) the enhancement in the learning fairness of LDA on objects regardless of sample size, hence reducing the effect of the curse of class imbalance.

Highlights

Classification algorithms including linear discriminant analysis (LDA) often deal with a data set with groups of similar sizes
The findings suggested that class imbalance affected the performance of LDA negatively and a resampling method (ROS or random undersampling (RUS)) improved its performance based on AUC through 4-fold cross-validation strategy
The findings from the means of TPR and TNR are in line with the findings from Jamaluddin and Mahat (2019) by which they enlightened the fact that the increment of the LDA’s performance in classifying the minority group objects was more significant than its performance decrement in the majority group object classification relatively

Summary

INTRODUCTION

Classification algorithms including linear discriminant analysis (LDA) often deal with a data set with groups of similar sizes (balanced groups). The section describes the methodology of the study in terms This manuscript attempts to further investigate and verify the novel findings of the recent works through some validation strategies including loocv, kfcv, rkfcv, B, and B632 on the performance of LDA with (or without) a resampling method. This manuscript starts with a thorough discussion on related works primarily based on three main studies.

FINDINGS

B B632 loocv kfcv rkfcv

DISCUSSION AND CONCLUSION