An ensemble feature selection method for high-dimensional data based on sort aggregation

Jie Wang,Jing Xu,Chengan Zhao,Yan Peng,Hongpeng Wang

doi:10.1080/21642583.2019.1620658

Jie Wang, Jing Xu + Show 3 more

Open Access

https://doi.org/10.1080/21642583.2019.1620658

Copy DOI

Journal: Systems Science & Control Engineering	Publication Date: May 29, 2019
Citations: 56	License type: open-access

Affiliation: Capital Normal University

Abstract

ABSTRACTWith the rapid development of the Internet, big data has been applied in a large amount of application. However, there are often redundant or irrelevant features in high dimensional data, so feature selection is particularly important. Because the feature subset obtained by a single feature selection method may be biased, an ensemble feature selection method named SA-EFS based on sort aggregation is proposed in this paper, and this method is oriented to classification tasks. For high-dimensional data sets, the results of three feature selection methods, chi-square test, maximum information coefficient and XGBoost, are aggregated by specific strategy. The integration effects of arithmetic mean and geometric mean aggregation strategy on this model are analyzed. In order to evaluate the classification and prediction performance of feature subset, three classifiers with excellent performance, KNN, Random Forest and XGBoost, are tested respectively, and the influence of threshold on classification performance is analyzed. The experimental results show that compared with the single feature selection method, the arithmetic mean aggregation ensemble feature selection can effectively improve the classification accuracy, and the threshold interval setting of 0.1 is a better choice.

Highlights

With the rapid development of the Internet and information technology, the scale of data that can be processed by various industries has been continuously developed, and problems such as ‘dimensional disasters’ have been brought about
Because the feature subset obtained by a single feature selection method may be biased, an ensemble feature selection method named SA-EFS based on sort aggregation is proposed in this paper, and this method is oriented to classification tasks
The experimental results show that compared with the single feature selection method, the arithmetic mean aggregation ensemble feature selection can effectively improve the classification accuracy, and the threshold interval setting of 0.1 is a better choice

Summary

Introduction

With the rapid development of the Internet and information technology, the scale of data that can be processed by various industries has been continuously developed, and problems such as ‘dimensional disasters’ have been brought about. Feature selection is to effectively reduce feature dimension and improve classification accuracy and efficiency by deleting irrelevant and redundant features in data sets It has the function of denoising and preventing machine learning model from over-fitting (Chandrashekar & Sahin, 2014). Set and aggregate learning results based on multiple optimal feature subsets. Due to the integration technology, the ensemble feature selection algorithm has better stability and robustness than other feature selection algorithms when dealing with high-dimensional data with multiple optimal feature subsets. The feature importance is normalized and aggregated, which can effectively avoid the prediction model caused by one or several feature selection methods mis-selecting feature subsets – low-performance issues. Based on the experimental results from three UCI machine learning data sets, the SA-EFS method fully considers the effectiveness of different feature selection methods. A higher predicted AUC value is obtained on the classification algorithm

Related works

Chi-square test feature selection

Feature screening of maximum information coefficient

XGBoost feature selection and classification principle

Forecast model evaluation indicators

Overall framework

16: Sorted FS according W

Experimental data sets

Experimental design

Classification accuracy results

Threshold alpha impact analysis

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An ensemble feature selection method for high-dimensional data based on sort aggregation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Systems Science & Control Engineering

Lead the way for us

Similar Papers

Feature selection and its combination with data over-sampling for multi-class imbalanced datasets
Chih-Fong Tsai ... Wei-Chao Lin
Applied Soft Computing | VOL. 153
Chih-Fong Tsai, et. al.Chih-Fong Tsai ... Wei-Chao Lin
17 Jan 2024
Applied Soft Computing | VOL. 153

MIC-SHAP: An ensemble feature selection method for materials machine learning
Junya Wang ... Wencong Lu
Materials Today Communications | VOL. 37
Junya Wang, et. al.Junya Wang ... Wencong Lu
16 Aug 2023
Materials Today Communications | VOL. 37

Comparing Two New Gene Selection Ensemble Approaches with the Commonly-Used Approach
David J Dittman ... Taghi M Khoshgoftaar
-
David J Dittman, et. al.David J Dittman ... Taghi M Khoshgoftaar
01 Dec 2012
01 Dec 2012

ECM-EFS: An ensemble feature selection based on enhanced co-association matrix
Ting Wu ... Lizhi Peng
Pattern Recognition | VOL. 139
Ting Wu, et. al.Ting Wu ... Lizhi Peng
18 Feb 2023
Pattern Recognition | VOL. 139

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An ensemble feature selection method for high-dimensional data based on sort aggregation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Systems Science & Control Engineering