Abstract

For most software systems, some of the software metrics collected during the software development cycle may contain redundant information, provide no information, or may have an adverse effect on prediction models built with these metrics. An intelligent selection of software metrics (features) using feature selection techniques (which reduce the feature subset to an optimal size) prior to building defect prediction models may improve the final defect prediction results. While some feature selection techniques consider each feature individually, feature subset selection evaluates entire feature subsets and thus can help remove redundant features. Unfortunately, feature subset selection may have the problem of selecting different features from similar datasets. This paper addresses the question of which feature subset selection methods are stable in the face of changes to the data (here, the addition or removal of instances). We examine twenty-seven feature subset selection methods, including two filter-based techniques and twenty-five wrapper-based techniques (five choices of wrapper learner combined with five choices of wrapper performance metric). We used the Average Tanimoto Index (ATI) as our stability metric, because it is able to compare two feature subsets of different size. All experiments were conducted on three software metric datasets from a real-world software project. Our results show that the Correlation-Based Feature Selection (CFS) approach has the greatest stability overall. All wrapper-based techniques are less stable than CFS. Among the twenty-five wrappers, in general the Naïve Bayes learner using either the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) or the Area Under the Precision-Recall Curve (PRC) performance metrics are the most stable wrapper-based approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.