CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

Li Ma,Suohai Fan

doi:10.1186/s12859-017-1578-z

Abstract

BackgroundThe random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization.ResultsWe propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability.ConclusionThe training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.

Highlights

The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting
On the SPECT dataset, Clustering Using Representatives (CURE)-synthetic minority oversampling technique (SMOTE) surpasses the other sampling algorithms with regard to F-value, Geometric Mean (G-mean), Area under the receiver operating characteristics (ROC) curve (AUC) and out of bag (OOB) error The best value of every performance evaluation criteria obtained by the algorithms are marked in boldface
From the Connectionist Bench results, we find that the artificial fish swarm algorithm (AFSA)-random forests (RF) achieves the minimum OOB error and the maximum margin

Summary

Introduction

The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. To improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. Breiman [3] proposed a novel ensemble learning classification, random forests, by combining bagging ensemble learning and Tin Kam Ho’s concept in 2001. It is possible to increase the classification accuracy in minor class samples of RF for imbalanced training sets through data preprocessing. A novel hybrid algorithm [14] using a radial basis function neural network (RBFNN) integrated with RF was proposed to improve the ability to classify the minor class of imbalanced datasets. The Mega-Trend-Diffusion (MTD) technique [17] was developed to obtain the best results on breast and colon cancer datasets by increasing the samples of the minority class when building the prediction model

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Mar 14, 2017
Citations: 163	License type: open-access

R Discovery Prime

R Discovery Prime

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Research on the Classification of High Dimensional Imbalanced Data based on the Optimization of Random Forest Algorithm
Ma Xiaojuan
-
Ma XiaojuanMa Xiaojuan
25 Aug 2018
25 Aug 2018

Research and Implementation on Power Analysis Attacks for Unbalanced Data
Xiaoyi Duan ... Xiaohong Fan
Security and Communication Networks | VOL. 2020
Xiaoyi Duan, et. al.Xiaoyi Duan ... Xiaohong Fan
22 May 2020
Security and Communication Networks | VOL. 2020

Prospectivity Mapping for Tungsten Polymetallic Mineral Resources, Nanling Metallogenic Belt, South China: Use of Random Forest Algorithm from a Perspective of Data Imbalance
Tongfei Li ... Shuai Leng
Natural Resources Research | VOL. 29
Tongfei Li, et. al.Tongfei Li ... Shuai Leng
03 Oct 2019
Natural Resources Research | VOL. 29

Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction
Md Anwar Hossen ... Md Shariful Islam
-
Md Anwar Hossen, et. al.Md Anwar Hossen ... Md Shariful Islam
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics