An Efficient Cost-Sensitive Feature Selection Using Chaos Genetic Algorithm for Class Imbalance Problem

Jing Bian,Hai Zhang,Xin-Guang Peng,Ying Wang

doi:10.1155/2016/8752181

Abstract

In the era of big data, feature selection is an essential process in machine learning. Although the class imbalance problem has recently attracted a great deal of attention, little effort has been undertaken to develop feature selection techniques. In addition, most applications involving feature selection focus on classification accuracy but not cost, although costs are important. To cope with imbalance problems, we developed a cost-sensitive feature selection algorithm that adds the cost-based evaluation function of a filter feature selection using a chaos genetic algorithm, referred to as CSFSG. The evaluation function considers both feature-acquiring costs (test costs) and misclassification costs in the field of network security, thereby weakening the influence of many instances from the majority of classes in large-scale datasets. The CSFSG algorithm reduces the total cost of feature selection and trades off both factors. The behavior of the CSFSG algorithm is tested on a large-scale dataset of network security, using two kinds of classifiers: C4.5 andk-nearest neighbor (KNN). The results of the experimental research show that the approach is efficient and able to effectively improve classification accuracy and to decrease classification time. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms.

Highlights

The class imbalance problem is found in various scientific and social arenas, such as fraud/intrusion detection, spam detection, risk management, technical diagnostics/monitoring, financial engineering, and medical diagnostics [1,2,3,4]
We focus on cost-sensitive feature selection based on both misclassification costs and test costs
The algorithm we proposed in this paper focused on the cost-sensitive fitness function but not parameter optimization of the Genetic algorithm (GA)

Summary

Introduction

The class imbalance problem is found in various scientific and social arenas, such as fraud/intrusion detection, spam detection, risk management, technical diagnostics/monitoring, financial engineering, and medical diagnostics [1,2,3,4]. Researchers have introduced many methods to address these problems, including combining sampling techniques with cost-sensitive learning, setting the cost ratio by inverting prior class distributions, and collecting the cost of features before classification [5, 8, 9]. Before briefly introducing cost-sensitive learning and its application to feature selection, we illustrate the imbalanced problem, which is the most relevant topic of study in the current research. We propose a new method for feature selection whose goal is to develop an efficient approach in the field of network security, an arena in which large numbers of imbalanced datasets are typical. Rather than improving on previous methods, our purpose is to match the performance of previous cost-sensitive feature selection approaches using a method that addresses very large datasets with imbalance problems

Related Work

Cost-Sensitive Feature Selection Model

Results of the Experimental Investigation

Conclusions