Combining Multiple Feature-Ranking Techniques and Clustering of Variables for Feature Selection

Anwar Ul Haq,Defu Zhang,He Peng,Sami Ur Rahman

doi:10.1109/access.2019.2947701

Anwar Ul Haq, Defu Zhang + Show 2 more

Open Access

https://doi.org/10.1109/access.2019.2947701

Copy DOI

Abstract

Feature selection aims to eliminate redundant or irrelevant variables from input data to reduce computational cost, provide a better understanding of data and improve prediction accuracy. Majority of the existing filter methods utilize a single feature-ranking technique, which may overlook some important assumptions about the underlying regression function linking input variables with the output. In this paper, we propose a novel feature selection framework that combines clustering of variables with multiple feature-ranking techniques for selecting an optimal feature subset. Different feature-ranking methods typically result in selecting different subsets, as each method has its own assumption about the regression function linking input variables with the output. Therefore, we employ multiple feature-ranking methods having disjoint assumption about the regression function. The proposed approach has a feature ranking module to identify relevant features and a clustering module to eliminate redundant features. First, input variables are ranked using regression coefficients obtained by training $L1$ regularized Logistic Regression, Support Vector Machine and Random Forests models. Those features which are ranked lower than a certain threshold are filtered-out. The remaining features are grouped into clusters using an exemplar-based clustering algorithm, which identifies data-points that exemplify the data better, and associates each data-point with an exemplar. We use both linear correlation coefficients and information gain for measuring the association between a data-point and its corresponding exemplar. From each cluster the highest ranked feature is selected as a delegate, and all delegates from the three ranked lists are combined into the final feature set using union operation. Empirical results over a number of real-world data sets confirm the hypothesis that combining features selected using multiple heterogeneous methods results in a more robust feature set and improves prediction accuracy. As compared to other feature selection approaches evaluated, features selected using linear correlation-based multi-filter feature selection achieved the best classification accuracy with 98.7%, 100%, 92.3% and 100% for Ionosphere, Wisconsin Breast Cancer, Sonar and Wine data sets respectively.

Highlights

For a given classification problem, machine learning algorithms use discriminative abilities of features to categorize observations into different classes, where each feature is an individual characteristic of the process under observation
Performance of a machine learning model depends on model specific factors, and on factors related to input data, such as total number
SIMULATIONS AND RESULTS In this study, we use five data sets namely Ionosphere, Wisconsin Breast Cancer (WBC), Sonar, Wine, and Vowels, downloaded from UCI machine learning database [62]. These data sets have been predominantly used in machine learning studies and cover a variety of different real-world problems

Summary

Introduction

For a given classification problem, machine learning algorithms use discriminative abilities of features to categorize observations into different classes, where each feature is an individual characteristic of the process under observation. Machine learning data sets have become very large, and in some cases the number of input variables even exceeds that of the samples. Performance of a machine learning model depends on model specific factors, and on factors related to input data, such as total number. In high-dimensional data sets, all features may not be important and some of these may be redundant, irrelevant or noise. The presence of redundant, irrelevant or noise variables leads to increased computational cost, but may affect predictive performance of the learning model. Machine learning models, which do not have an embedded feature selection mechanism, will accumulate small noisy contribution for each noise variable to the predicted variable. Performance of machine learning models having embedded dimensionality reduction mechanism such as Deep

Methods

Results

Conclusion