A New Feature Selection Method Based on a Self-Variant Genetic Algorithm Applied to Android Malware Detection

Le Wang,Xin Yong,Shanshan Gao,Yuelin Gao

doi:10.3390/sym13071290

Abstract

In solving classification problems in the field of machine learning and pattern recognition, the pre-processing of data is particularly important. The processing of high-dimensional feature datasets increases the time and space complexity of computer processing and reduces the accuracy of classification models. Hence, the proposal of a good feature selection method is essential. This paper presents a new algorithm for solving feature selection, retaining the selection and mutation operators from traditional genetic algorithms. On the one hand, the global search capability of the algorithm is ensured by changing the population size, on the other hand, finding the optimal mutation probability for solving the feature selection problem based on different population sizes. During the iteration of the algorithm, the population size does not change, no matter how many transformations are made, and is the same as the initialized population size; this spatial invariance is physically defined as symmetry. The proposed method is compared with other algorithms and validated on different datasets. The experimental results show good performance of the algorithm, in addition to which we apply the algorithm to a practical Android software classification problem and the results also show the superiority of the algorithm.

Highlights

Data classification is one of the tasks of data mining in the field of machine learning and in the framework of pattern recognition [1]; the quality of the data has a significant impact on the performance of these data mining methods
To better understand the advantages of asexual genetic algorithms in feature selection, this paper proposes to analyze its principles by fictionalizing different individuals
We change the number of initial population sizes to increase the diversity of the population, i.e., to improve the global search capability of the algorithm, and on the other hand, we experimentally verify the effect of the mutation rate on the accuracy while keeping the population size constant

Summary

Introduction

Data classification is one of the tasks of data mining in the field of machine learning and in the framework of pattern recognition [1]; the quality of the data has a significant impact on the performance of these data mining methods. Data dimensionality reduction methods include feature extraction(FE), where features are transformed into a smaller dimension, and feature selection(FS) [3], where features are selected from the complete set of features to build a subset of features without transformation [4]. The method chosen in this paper is feature selection, the aim of which is to identify the most distinct subset of features in the whole feature set and provide a suitable recognition rate for a particular classifier [5]. The feature selection problem differs from traditional optimization problems.

Methods

Results

Conclusion