Abstract

Feature selection is an important part of a pattern recognition system. A feature selection method is required to be general enough to find representative features from training data, which are then used for classifying test patterns. The situation where the features selected from the training data are quite different from the representative features of the testing data is called over-selecting. The main causes of the over-selecting phenomenon are: non-comprehensive consideration of statistical properties of the training data, heuristic search strategies for feature selection and small sample size of the data set for training. In this paper, we show the influence of the over-selecting phenomenon on the over-fitting phenomenon of machine learning algorithms. We propose a new framework to address principal causes of over-selecting and thus reduce the chance of over-fitting. Our new framework that we call Ensemble Feature Selection measure (EnFS), allows to consider many statistical properties of a given data set at the same time by combining many feature selection methods used in the filter model. From the chosen feature selection measures, a new combined measure is constructed. We also propose a new search algorithm that ensures the globally optimal feature subsets by means of the constructed measure. The new search approach is based on solving a mixed 0–1 linear programming (M01LP) problem by means of the branch-and-bound algorithm. In this M01LP problem, the number of constraints and variables is linear in the number of full set features. In order to evaluate the quality of our EnFS measure, we chose the design of an intrusion detection system (IDS) as a possible application. Experimental results obtained over the KDD CUP'99 benchmarking data set for IDS show that our EnFS measure is capable of reducing over-fitting by addressing over-selecting.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call