Abstract

Aiming at low classification accuracy of imbalanced datasets, an oversampling algorithm—AGNES-SMOTE (Agglomerative Nesting-Synthetic Minority Oversampling Technique) based on hierarchical clustering and improved SMOTE—is proposed. Its key procedures include hierarchically cluster majority samples and minority samples, respectively; divide minority subclusters on the basis of the obtained majority subclusters; select “seed sample” based on the sampling weight and probability distribution of minority subcluster; and restrict the generation of new samples in a certain area by centroid method in the sampling process. The combination of AGNES-SMOTE and SVM (Support Vector Machine) is presented to deal with imbalanced datasets classification. Experiments on UCI datasets are conducted to compare the performance of different algorithms mentioned in the literature. Experimental results indicate AGNES-SMOTE excels in synthesizing new samples and improves SVM classification performance on imbalanced datasets.

Highlights

  • Imbalanced dataset is featured with having fewer instances of some classes than others in a dataset

  • The existing oversampling algorithms mainly deal with between-class imbalance and neglect within-class imbalance

  • Some problems are ignored, such as samples being oversampled are not selected, noise is not removed, synthetic samples will overlap, and samples will be distributed “marginally.” To solve the abovementioned problems, an oversampling algorithm—AGNES-SMOTE—is presented in this paper, which is based on the hierarchical clustering and improved SMOTE. is algorithm follows the following procedures: filter noise samples in the dataset; cluster majority samples and minority samples through the AGNES algorithm, respectively; divide minority subclusters in the light of the obtained majority subclusters; select samples for oversampling based on sampling weight and the probability distribution of minority subclusters; restrict the generation of new samples in a certain area by the centroid method

Read more

Summary

Introduction

Imbalanced dataset is featured with having fewer instances of some classes than others in a dataset. Compared with Cluster-SMOTE, K-means-SMOTE clustered the entire datasets, found the overlap and avoided oversampling in unsafe areas, restricted the synthetic samples in the target area, and eliminated within-class and between-class imbalances It avoided noise samples and attained good results. Its procedures are listed as follows: filter noise samples, adopt the AGNES algorithm to cluster minority samples, form the minority subclusters iteratively, and consider the majority samples distribution during the merging process to avoid generating overlapping synthetic samples. Repeat this operation until the distance between the two closest minority subclusters exceeds the set valve-value. AGNES-SMOTE attains a better result in the experiment

Preliminary Theory
Improved SMOTE Algorithm
Experimental Design and Result Analysis
Experimental Analysis
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.