Abstract

ABSTRACTWith the rapid development of the Internet, big data has been applied in a large amount of application. However, there are often redundant or irrelevant features in high dimensional data, so feature selection is particularly important. Because the feature subset obtained by a single feature selection method may be biased, an ensemble feature selection method named SA-EFS based on sort aggregation is proposed in this paper, and this method is oriented to classification tasks. For high-dimensional data sets, the results of three feature selection methods, chi-square test, maximum information coefficient and XGBoost, are aggregated by specific strategy. The integration effects of arithmetic mean and geometric mean aggregation strategy on this model are analyzed. In order to evaluate the classification and prediction performance of feature subset, three classifiers with excellent performance, KNN, Random Forest and XGBoost, are tested respectively, and the influence of threshold on classification performance is analyzed. The experimental results show that compared with the single feature selection method, the arithmetic mean aggregation ensemble feature selection can effectively improve the classification accuracy, and the threshold interval setting of 0.1 is a better choice.

Highlights

  • With the rapid development of the Internet and information technology, the scale of data that can be processed by various industries has been continuously developed, and problems such as ‘dimensional disasters’ have been brought about

  • Because the feature subset obtained by a single feature selection method may be biased, an ensemble feature selection method named SA-EFS based on sort aggregation is proposed in this paper, and this method is oriented to classification tasks

  • The experimental results show that compared with the single feature selection method, the arithmetic mean aggregation ensemble feature selection can effectively improve the classification accuracy, and the threshold interval setting of 0.1 is a better choice

Read more

Summary

Introduction

With the rapid development of the Internet and information technology, the scale of data that can be processed by various industries has been continuously developed, and problems such as ‘dimensional disasters’ have been brought about. Feature selection is to effectively reduce feature dimension and improve classification accuracy and efficiency by deleting irrelevant and redundant features in data sets It has the function of denoising and preventing machine learning model from over-fitting (Chandrashekar & Sahin, 2014). Set and aggregate learning results based on multiple optimal feature subsets. Due to the integration technology, the ensemble feature selection algorithm has better stability and robustness than other feature selection algorithms when dealing with high-dimensional data with multiple optimal feature subsets. The feature importance is normalized and aggregated, which can effectively avoid the prediction model caused by one or several feature selection methods mis-selecting feature subsets – low-performance issues. Based on the experimental results from three UCI machine learning data sets, the SA-EFS method fully considers the effectiveness of different feature selection methods. A higher predicted AUC value is obtained on the classification algorithm

Related works
Chi-square test feature selection
Feature screening of maximum information coefficient
XGBoost feature selection and classification principle
Forecast model evaluation indicators
Overall framework
16: Sorted FS according W
Experimental data sets
Experimental design
Classification accuracy results
Threshold alpha impact analysis
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call