A comparative study of feature-ranking and feature-subset selection techniques for improved fault prediction

Santosh Singh Rathore,Atul Gupta

doi:10.1145/2590748.2590755

Abstract

The quality of a fault prediction model depends on the software metrics that are used to build the prediction model. Feature selection represents a process of selecting a subset of relevant features that may lead to build improved prediction models. Feature selection techniques can be broadly categorized into two subcategories: feature-ranking and feature-subset selection. In this paper, we present a comparative investigation of seven feature-ranking techniques and eight feature-subset selection techniques for improved fault prediction. The performance of these feature selection techniques is evaluated using two popular machine-learning classifiers: Naive Bayes and Random Forest, over fourteen software project's fault-datasets obtained from the PROMISE data repository. The performances were measured using F-measure and AUC values. Our results demonstrated that feature-ranking techniques produced better results compared to feature-subset selection techniques. Among, the feature-ranking techniques used in the study, InfoGain and PCA techniques provided the best performance over all the datasets, while for feature-subset selection techniques ClassifierSubsetEval and Logistic Regression produced better results against their peers.

Full Text