A binary Krill Herd approach based feature selection for high dimensional data

V Preeja,A H Shahana

doi:10.1109/inventive.2016.7824803

Abstract

Nowadays high dimensional data plays an important role in many scientific and research applications. A high dimensional data consists of several features or attributes. These data may contain redundant and irrelevant features. The curse of dimensionality is an important problem in data mining and machine learning. In order to reduce the dimensionality of data and to improve the classifier performance, the unwanted and redundant features need to be removed. Feature selection techniques are used to identify the redundant and irrelevant features from the original set of data. Feature selection is the process of identifying most representative features from a collection of features. Existing feature selection algorithms like FAST, reduces dimensionality by removing irrelevant and redundant features from a collection of data and selects the required subset of features. The FAST algorithm uses information gain to calculate the symmetric uncertainty which is used as the measure of correlation between features. Information gain will check for only the presence of a feature with respect to another feature or class. But this will not give an accurate result. Thus the accuracy of results produced from the FAST algorithm is low. So a new Binary Krill Herd (BKH) algorithm is introduced for feature selection, in which a classifier is used as the objective function. BKH is a biological algorithm based on the movement of krill individuals. A binary vector is used to determine whether or not a feature will be selected, where 1 denotes whether a feature will be selected and 0 otherwise. The algorithm is evaluated a number of times with respect to the changes in the number of iterations per run, population size, problem dimension, and number of runs. The Binary Krill Herd algorithm not only considers the presence but also the change of a feature with respect to another or class. This helps to identify strongly correlated features compared to FAST algorithm. The proposed BKH algorithm has been compared against with FAST algorithm using several datasets. The datasets are derived from UCI machine learning repository. For the validation of the new algorithm, the classification accuracy is tested against several classifiers, mainly J48, Naive Bayes, Decision tree etc. The experimental result shows that the BKH algorithm for feature selection achieves more reduction in dimensionality by selecting only a small portion of the original features and outperforms FAST algorithms by producing more accurate results in less time.

Full Text