Decision Variants for the Automatic Determination of Optimal Feature Subset in RF-RFE.

Qi Chen,Xinyi Liu,Ran Su,Zhaopeng Meng,Qianguo Jin

doi:10.3390/genes9060301

Abstract

Feature selection, which identifies a set of most informative features from the original feature space, has been widely used to simplify the predictor. Recursive feature elimination (RFE), as one of the most popular feature selection approaches, is effective in data dimension reduction and efficiency increase. A ranking of features, as well as candidate subsets with the corresponding accuracy, is produced through RFE. The subset with highest accuracy (HA) or a preset number of features (PreNum) are often used as the final subset. However, this may lead to a large number of features being selected, or if there is no prior knowledge about this preset number, it is often ambiguous and subjective regarding final subset selection. A proper decision variant is in high demand to automatically determine the optimal subset. In this study, we conduct pioneering work to explore the decision variant after obtaining a list of candidate subsets from RFE. We provide a detailed analysis and comparison of several decision variants to automatically select the optimal feature subset. Random forest (RF)-recursive feature elimination (RF-RFE) algorithm and a voting strategy are introduced. We validated the variants on two totally different molecular biology datasets, one for a toxicogenomic study and the other one for protein sequence analysis. The study provides an automated way to determine the optimal feature subset when using RF-RFE.

Highlights

Feature selection is a frequently used technology in the fields of machine learning and statistics, aiming to reduce the high-dimensional feature space by selecting relevant features and removing redundant features
We provide a detailed analysis for automatically selecting the optimal feature subset using various decision variants in random forest (RF)-recursive feature elimination
A number of Recursive feature elimination (RFE)-based feature selection algorithms have been developed over the years, there is not much available for the optimal feature subset selection after obtaining a group of subsets and corresponding accuracy

Summary

Introduction

Feature selection is a frequently used technology in the fields of machine learning and statistics, aiming to reduce the high-dimensional feature space by selecting relevant features and removing redundant features. Over the past few years, driven by the applications in a wide range of fields, such as genetics, chemometrics, business etc., feature selection, as one of the most important research areas in high dimensional data analysis, has attracted more and more attention. It can simplify the model and reduce the computational cost to a large extent [1,2]. In Su et al.’s study, using a feature selection technique, they found out that fluorescent marker plays an extremely important role in kidney toxicity [4]. In Saeys et al.’s study, they discussed the applications of feature selection techniques

Methods

Results

Conclusion