A Study on Software Metric Selection for Software Fault Prediction

Huanjing Wang,Taghi M Khoshgoftaar

doi:10.1109/icmla.2019.00176

Abstract

For most software systems, superfluous software metrics are often collected. Sometimes, metrics that are collected may be redundant or irrelevant to fault prediction results. Feature (software metric) selection helps separating relevant software metrics from irrelevant or redundant ones, thereby identifying the small set of software metrics that are best predictors of fault proneness for new components, modules, or releases. In this study, we compare three forms of feature selection techniques (filter-and wrapper-based subset evaluators along with two search techniques (Best First (BF) and Greedy Stepwise (GS)), and feature ranking on four datasets from a real world software project. Five learners are used to build fault prediction models with the selected software metrics. Each model is assessed using the Area Under the Receiver Operating Characteristic Curve (AUC). We find that wrapper-based subset evaluators performed best and feature ranking performed worst. In addition, the model built with the logistic regression (LR) learner performs best in terms of the AUC performance metric. This leads us to recommend the use of the wrapper-based subset evaluators to select software metric subsets and the LR learner for building software fault prediction models.

Full Text