Abstract

Imbalanced data and label noise are ubiquitous challenges in data mining and machine learning that severely impair classification performance. The synthetic minority oversampling technique (SMOTE) and its variants have been proposed, but they are easily constrained by hyperparameter optimization such as k-nearest neighbors, their performance deteriorates owing to noise, they rarely consider data distribution information, and they cause high complexity. Furthermore, SMOTE-based methods perform random linear interpolation between each minority class sample and its randomly selected k-nearest neighbors, regardless of sample differences and distribution information. To address the above problems, an adaptive, robust, and general weighted oversampling framework based on relative neighborhood density (WRND) is proposed. It can combine with most SMOTE-based sampling algorithms easily and improve performance. First, it adaptively distinguishes and filters noisy and outlier samples by introducing the natural neighbor, which inherently avoids the extra noise and overlapping samples introduced by the synthesis of noisy samples. The relative neighborhood density of each sample can then be obtained, which reflects the intra-class and inter-class distribution information within the natural neighborhood. To alleviate the blindness of SMOTE-based methods, the number and locations of synthetic samples are informedly assigned based on distribution information and reasonable generalization of natural neighborhoods of original samples. Extensive experiments on 23 benchmark datasets and six classic classifiers with eight pairs of representative sampling algorithms and two state-of-the-art frameworks, significantly demonstrate the effectiveness of the WRND framework. Code and framework are available at https://github.com/dream-lm/WRND_framework.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call