A Comparative Study on the Stability of Software Metric Selection Techniques

Huanjing Wang,Amri Napolitano,Taghi M Khoshgoftaar,Randall Wald

doi:10.1109/icmla.2012.142

Abstract

In large software projects, software quality prediction is an important aspect of the development cycle to help focus quality assurance efforts on the modules most likely to contain faults. To perform software quality prediction, various software metrics are collected during the software development cycle, and models are built using these metrics. However, not all features (metrics) make the same contribution to the class attribute (e.g., faulty/not faulty). Thus, selecting a subset of metrics that are relevant to the class attribute is a critical step. As many feature selection algorithms exist, it is important to find ones which will produce consistent results even as the underlying data is changed, this quality of producing consistent results is referred to as stability. In this paper, we investigate the stability of seven feature selection techniques in the context of software quality classification. We compare four approaches for varying the underlying data to evaluate stability: the traditional approach of generating many sub samples of the original data and comparing the features selected from each, an earlier approach developed by our research group which compares the features selected from sub samples of the data with those selected from the original, and two newly-proposed approaches based on comparing two sub samples which are specifically designed to have same number of instances and a specified level of overlap, with one of these new approaches comparing within each pair while the other compares the generated sub samples with the original dataset. The empirical validation is carried out on sixteen software metrics datasets. Our results show that ReliefF is the most stable feature selection technique. Results also show that the level of overlap, degree of perturbation, and feature subset size do affect the stability of feature selection methods. Finally, we find that all four approaches of evaluating stability produce similar results in terms of which feature selection techniques are best under different circumstances.

Full Text