Analysis of Focussed Under-Sampling Techniques with Machine Learning Classifiers

Ankita Bansal,Abha Jain

doi:10.1109/sera51205.2021.9509270

Abstract

Class Imbalance Problem is the major issue in machine intelligence producing biased classifiers that work well for the majority class but have a relatively poor performance for the minority class. To ensure the development of accurate prediction models, it is essential to deal with the class imbalance problem. In this paper, the class imbalance problem is handled using focused undersampling techniques viz. Cluster Based, Tomek Link and Condensed Nearest Neighbours which equalize the number of instances of the two types of classes by undersampling the majority class based on some particular criteria. This is in contrast to random undersampling where the data samples are selected randomly from the majority class leading to underfitting and loss of some important datapoints. To fairly compare and evaluate the performance of focused undersampling approaches, prediction models are constructed using popular machine learning classifiers like K-Nearest Neighbor, Decision Tree and Naive Bayes. The results have shown that Decision Tree outperformed other machine learning techniques. Comparing and contrasting the undersampling approaches for Decision Tree concluded Condensed Nearest Neighbours to be best amongst others.

Full Text