Feature selection (FS) is a crucial step in data engineering, as it not only saves time and resources for model training but also improves the prediction performance. In many realistic data sets, the feature space is formed by binary features which are related via generalization-specialization relationships (a.k.a., hierarchical feature space). These interdependencies among features, which can be described via a directed acyclic graph, present new challenges for classical FS methods. In this paper, a Hierarchical Feature Selection method based on Correlation and Structural Redundancy (HFS-CSR) is proposed to address it. In HFS-CSR, the correlation between features and the class label is measured based on Fisher Score whereas the structural redundancy among features are assessed quantitatively via hierarchical redundancy and depth penalties. The importance of features are determined by combining them adaptively for feature filtering. Experiments are conducted based on two kinds of typical data sets with hierarchical feature space: aging gene data sets and NLP data sets. The results demonstrate that HFS-CSR can achieve better performance compared with seven existing FS methods.
Read full abstract