Abstract

AbstractOne of the challenging issues in user‐click data of online advertising is the uneven class distribution which biases classification models. Resampling the data is a popular choice for obtaining class balance. However, oversampling results in overfitting, whilst under‐sampling results in information loss. Moreover, enhancing separability between samples, where the classes overlap closer to the decision boundary, is another challenge, which requires a careful pruning of instances towards increasing the separability in data space. Therefore, in this work, a new hybrid data sampling algorithm SMOTEOSS is designed and evaluated, concatenating the synthetic minority oversampling technique (SMOTE) followed by one‐sided selection (OSS) to balance the class distribution. The working of SMOTEOSS is twofold‐ first, it oversamples the under‐represented class distribution using the SMOTE by generating synthetic instances. However, the generation of synthetic instances closer to the decision boundary directly influences the learning model's decision‐making. Utilising OSS, the proposed method then identifies TOMEKLINKS and eliminates the noisy majority instances whilst eliminating the redundant instances. The proposed method's effectiveness is validated on the FDMA 2012 dataset against 10 state‐of‐the‐art sampling methods utilising the gradient tree boosting learning model. To authenticate SMOTEOSS, a fair comparison is made by conducting experiments on other 10 benchmark imbalanced datasets using 10‐fold cross‐validation. Performance is measured using average precision, recall, F1‐score, G‐mean, the area under curve (AUC) and reduction rate. Results showed that the designed hybrid methodology is an efficient alternative to existing sampling methods. The Wilcoxon signed‐rank test is employed to demonstrate significant differences amidst the proposed and conventional sampling algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call