Abstract

Class unbalanced datasets are frequently encountered in a variety of areas including health, security, and finance. Often these datasets create bias in the supervised learning models trained for the prediction task. One of the most successful techniques to handle imbalanced data is undersampling. Experiments demonstrate that cluster-based undersampling improves over random undersampling in many cases. In this paper, we propose three new boosting approaches to improve the performance of cluster-based undersampling technique: (i) inject unlabeled data into training data for improved clustering; (ii) keep the instances close to cluster boundary and centroid while undersampling and (iii) remove the majority samples in the neighborhood of minority data in each cluster. We experimented with our boosting methods over 49 standard benchmark datasets and analyzed the performances in terms of standard evaluation metrics. Experimental results suggest these boosting techniques are promising and significantly improve over cluster-based undersampling strategies.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.