New boosting approaches for improving cluster-based undersampling in problems with imbalanced data

Abdullah-All-Tanvir Abdullah-All-Tanvir,Iftakhar Ali Khandokar,Swakkhar Shatabda

doi:10.1016/j.dajour.2023.100316

Abdullah-All-Tanvir Abdullah-All-Tanvir, Iftakhar Ali Khandokar + Show 1 more

Open Access

https://doi.org/10.1016/j.dajour.2023.100316

Copy DOI

Journal: Decision Analytics Journal	Publication Date: Sep 1, 2023
Citations: 2	License type: cc-by-nc-nd

Affiliation: United International University

Abstract

Class unbalanced datasets are frequently encountered in a variety of areas including health, security, and finance. Often these datasets create bias in the supervised learning models trained for the prediction task. One of the most successful techniques to handle imbalanced data is undersampling. Experiments demonstrate that cluster-based undersampling improves over random undersampling in many cases. In this paper, we propose three new boosting approaches to improve the performance of cluster-based undersampling technique: (i) inject unlabeled data into training data for improved clustering; (ii) keep the instances close to cluster boundary and centroid while undersampling and (iii) remove the majority samples in the neighborhood of minority data in each cluster. We experimented with our boosting methods over 49 standard benchmark datasets and analyzed the performances in terms of standard evaluation metrics. Experimental results suggest these boosting techniques are promising and significantly improve over cluster-based undersampling strategies.

Full Text