Improved Feature Selection Model for Big Data Analytics

Ibrahim M El-Hasnony,Reham R Mostafa,Sherif I Barakat,Mohamed Elhoseny

doi:10.1109/access.2020.2986232

Ibrahim M El-Hasnony, Reham R Mostafa + Show 2 more

Open Access

https://doi.org/10.1109/access.2020.2986232

Copy DOI

Abstract

Although there are many attempts to build an optimal model for feature selection in Big Data applications, the complex nature of processing such kind of data makes it still a big challenge. Accordingly, the data mining process may be obstructed due to the high dimensionality and complexity of huge data sets. For the most informative features and classification accuracy optimization, the feature selection process constitutes a mandatory pre-processing phase to reduce dataset dimensionality. The exhaustive search for the relevant features is time-consuming. In this paper, a new binary variant of the wrapper feature selection grey wolf optimization and particle swarm optimization is proposed. The K-nearest neighbor classifier with Euclidean separation matrices is used to find the optimal solutions. A tent chaotic map helps in avoiding the algorithm from locked to the local optima problem. The sigmoid function employed for converting the search space from a continuous vector to a binary one to be suitable to the problem of feature selection. Cross-validation K-fold is used to overcome the overfitting issue. A variety of comparisons have been made with well-known and common algorithms, such as the particle swarm optimization algorithm, and the grey wolf optimization algorithm. Twenty datasets are used for the experiments, and statistical analyses are conducted to approve the performance and the effectiveness and of the proposed model with measures like selected features ratio, classification accuracy, and computation time. The cumulative features picked through the twenty datasets were 196 out of 773, as opposed to 393 and 336 in the GWO and the PSO, respectively. The overall accuracy is 90% relative to other algorithms ' 81.6 and 86.8. The total processing time for all datasets equals 184.3 seconds, wherein GWO and PSO equal 272 and 245.6, respectively.

Highlights

Artificial Intelligence (AI) techniques gained great attention in many applications due to their ability to extract unexpected information
PARAMETER SETTING For comparing the efficiency of the advocated algorithm with other popular modern algorithms for feature selection, it compared to the standard particle swarm optimization (PSO) and the standard grey wolf optimization (GWO)
The results showed that the proposed model gives us good output results in the average obtained features and improvements in classification accuracies for all datasets in a reasonable time

Summary

Introduction

Artificial Intelligence (AI) techniques gained great attention in many applications due to their ability to extract unexpected information. Data mining includes various pre-processing steps (integration, sorting, conversion, reduction, etc.), presentation of knowledge and. The performance of clustering and classification methods is significantly affected by the increase in the data set dimensions, since algorithms in both categories operate on the data set dimensions. The drawbacks of high-dimensional datasets include high model build time, redundant information, and a degraded quality that makes data analysis very difficult. To solve this problem, the selection of the features is used as a main pre-processing step to choose a subset of features from the large data set, and increase the precise classification and clustering models that trigger noisy, foreign and ambiguous data removal [2]. The first step is to use a tool to pick sub-sets of features in the search strategy

Objectives

Results

Discussion

Conclusion