Mega trend diffusion-siamese network oversampling for imbalanced datasets’ SVM classification

Liang-Sian Lin,Yao-San Lin,Der-Chiang Li,Yi-Ting Chen

doi:10.1016/j.asoc.2023.110406

Abstract

Imbalanced class distribution is a frequent and problematic issue in the domains of data engineering and machine learning. Traditional classification algorithms or machine learning models frequently fail, in this difficult situation, to provide accurate classification results for the minority class. To improve classification performance for the minority class, some research has advised creating artificial instances in the minority class using the oversampling technique. However, on the support vectors machines (SVM) model, as new minority class instances were created in an original data space, classification improvements made using these popular oversampling techniques may not hold true. In this paper, we develop a novel oversampling method termed DB-MTD-SN (for distance-based mega-trend-diffusion Siamese Network) for generating artificial minority class instances to increase the SVM model’s classification accuracy for imbalanced datasets. In the proposed method, we utilize distance-based mega-trend-diffusion (DB-MTD) technique to estimate data domain for tiny support vectors to generate synthetic examples with multi-model distributions of minority class. Further, we construct a novel Siamese network model based on membership function (MF-based SN) to find the most representative synthetic minority class examples. The proposed SN model is used to map the original data onto a high-dimensional space to easily learn complicated patterns between majority and minority classes. In our proposed SN model, the MF-based contrastive loss function is used to measure the similarity between a new synthetic example and original examples to avoid generating noise.To demonstrate efficacy of the proposed DB-MTD-SN approach, ten benchmark datasets are used in this study. On two types of SVM models, we compare the proposed method with three state-of-the-art oversampling methods. Further, three evaluation metrics: G-mean, F1, and index of balanced accuracy (IBA) are utilized to measure SVM classification performance on imbalanced datasets. The experimental datasets are set at different imbalanced ratios (IRs) as 15, 20, and 30 to test classification performance using five methods. For a high IR value of 30, on the two SVM models, the proposed method achieved the best average in terms of G-mean (0.865 and 0.865), F1 (0.785 and 0.772), and IBA (0.681 and 0.689) metrics, respectively. The paired Wilcoxon test is used to evaluate whether the suggested approach has statistically significant differences from the four other methods on the three evaluation metrics. The test results demonstrate that classification results using the proposed DB-MTD-SN method enjoys significant improvements (i.e. p-value <0.05) in terms of G-mean, F1, and IBA indicators when compared to the other four methods. Our experimental results indicate the suggested DB-MTD-SN method outperforms other oversampling methods for imbalanced datasets.

Full Text