Abstract

Imbalanced data learning is a ubiquitous challenge in data mining and machine learning. In particular, the ubiquity and inevitability of noise can exacerbate severe performance degradation. The synthetic minority oversampling technique (SMOTE) and its variants have been proposed. The core ideas of these variants are emphasizing the specific area or combining it with different noise filters; they introduce additional parameters that are difficult to optimize or rely on specific noise filters. Furthermore, SMOTE-based methods randomly select the nearest neighbor samples and perform random interpolation to synthesize new samples without considering the impact of the sample space’s chaotic degree. In this study, a framework called SW is proposed, which performs weighted sampling by calculating the sample space’s chaos. It is a general, robust and adaptive framework that copes with noisy imbalanced datasets and combines various oversampling algorithms to improve their performances. In the SW framework, the complete random forest (CRF) is introduced to divide the sample space and adaptively assign weights to distinguish and filter noisy and outlier samples. When synthesizing a new sample, the SW framework selects the seed samples’ neighbors and calculates the informed position using the derived weights, bringing the new sample closer to the safe area. Experimental results on 16 benchmark datasets and eight classic classifiers with eight pairs of representative oversampling algorithms demonstrate the SW framework’s effectiveness. The SW framework improves significantly in high-noise situations. In particular, SW-kmeans-SMOTE improved by approximately 5 % on average across all the metrics. Code and framework are available at https://github.com/dream-lm/SW_framework.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.