Important Features Detection in Continuous Data

Piotr Fulmanski,Alicja Miniak-Górecka

doi:10.14569/ijacsa.2012.031239

Abstract

In this paper, a method for calculating the importance factor of continuous features from a given set of patterns is presented. A real problem in many practical cases, like medical data, is to find which parts of patterns are crucial for correct classification. This leads to the need of preprocessing all data, which has influence on both time and accuracy of applied methods (when unimportant data hide those which are important). There are some methods that allow selection of important features for binary and sometimes discrete data or, after some preprocessing, continuous data. Very often however, such conversion is burdened with the risk of losing important data, which is a result of lack of knowledge of optimal discretization consequence. Proposed method allows to avoid that problem, because it is based on original, non-transformed continuous data. Two factors - concentration and diversity - are defined and are used to calculate the importance factor for each feature and pattern. Based on those factors e.g. unimportant features can be identified to decrease dimension of input data or ''bad'' patterns can be detected to improve classification. An example how proposed method can be used to improve decision tree is given as well.

Full Text