Abstract

Data imbalance is a common phenomenon in machine learning. In the imbalanced data classification, minority samples are far less than majority samples, which makes it difficult for minority to be effectively learned by classifiers. A synthetic minority oversampling technique (SMOTE) improves the sensitivity of classifiers to minority by synthesizing minority samples without repetition. However, the process of synthesizing new samples in the SMOTE algorithm may lead to problems such as "noisy samples" and "boundary samples." Based on the above description, we propose a synthetic minority oversampling technique based on Gaussian mixture model filtering (GMF-SMOTE). GMF-SMOTE uses the expected maximum algorithm based on the Gaussian mixture model to group the imbalanced data. Then, the expected maximum filtering algorithm is used to filter out the "noisy samples" and "boundary samples" in the subclasses after grouping. Finally, to synthesize majority and minority samples, we design two dynamic oversampling ratios. Experimental results show that the GMF-SMOTE performs better than the traditional oversampling algorithms on 20 UCI datasets. The population averages of sensitivity and specificity indexes of random forest (RF) on the UCI datasets synthesized by GMF-SMOTE are 97.49% and 97.02%, respectively. In addition, we also record the G-mean and MCC indexes of the RF, which are 97.32% and 94.80%, respectively, significantly better than the traditional oversampling algorithms. More importantly, the two statistical tests show that GMF-SMOTE is significantly better than the traditional oversampling algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call