Sampling technique for noisy and borderline examples problem in imbalanced classification

Abhishek Dixit,Ashish Mani

doi:10.1016/j.asoc.2023.110361

Abstract

Class imbalance Learning (CIL) is an important machine learning branch. Due to an imbalanced dataset, the efficiency of the classifiers is impacted. Various under/oversampling approaches are applied to the dataset to solve the class imbalance problem. The most successful among all the present solutions for data imbalance is the Synthetic Minority Oversampling Technique (SMOTE) which has a broad range of handling real-world applications. Noise and borderline examples are the two factors that degrade the performance of SMOTE and its variants. To address these issues, filtering-based methods have been developed. However, there are drawbacks associated with the filtering method. Firstly, error detection approaches in filtering methods are highly dependent on parameter settings. Secondly, samples identified during error detection are deleted or filtered from the sampling process which leads to abnormality of obtaining decision boundary and thus problem of again the class imbalance problem. To fix the problems associated with state-of-the-art filtering-based approaches, a novel oversampling filter-based method SMOTE-TLNN-DEPSO is proposed in this paper. In this hybrid variant SMOTE-TLNN-DEPSO, using SMOTE method the synthetic samples are generated to enhance the original class-imbalance data. Next, the two-layer natural neighbors’ technique is used for error detection which identifies the noisy and borderline examples. Lastly, instead of deleting the identified noisy and borderline examples, the hybrid variant of the differential evolution (DE) algorithm based on particle swarm optimization (PSO) called DEPSO is applied to optimize and modify iteratively the position (attributes). SMOTE-TLNN-DEPSO technique shows the advantage over other state of art SMOTE-based filtering-based approaches by solving the noise problem; the error detection technique using the nearest neighbor is parameter-free; Utilizing DEPSO approach the identified noisy samples by error detection technique are optimized instead of removing them. This helps in maintaining the imbalance ratio and improving the boundary; this approach is very appropriate for data sets having more noisy attributes especially class attributes the efficiency and usefulness of the proposed SMOTE-TLNN-DEPSO are demonstrated by exhaustive comparison experiments on artificial and real data sets.

Full Text