Software Defect Prediction Datasets Research Articles

Software Defect Prediction (SDP) is one of the most vital and cost-efficient operations to ensure the quality of software under developed. The performance of SDP heavily relies on the characteristics of experimental datasets (or say SDP datasets). However, there often exists the phenomenon of class overlap in the SDP datasets, i.e., defective modules and non-defective modules are similar in terms of values of metrics. Class overlap hinders the smooth performance as well as the use of SDP models. Even though efforts have been made to investigate the impact of overlapping instance removing techniques on the performance of SDP, many open issues are still challenging yet unknown. For example, 1) how to effectively identify the overlapping instances? 2) Whether is the phenomenon of class overlap universal in the SDP datasets? 3) What are the impacts of class overlap on the performance and interpretation of SDP models? Questions like these are very important but have not been fully explored yet. In this paper, we conduct an empirical study to comprehensively investigate the impact of class overlap on SDP. Specifically, we first propose an overlapping instances identification approach by analyzing the class distribution in the local neighborhood of a given instance. Based on the approach, we then investigate the impact of class overlap on the performance and the interpretation of seven representative SDP models. Finally, we investigate the impact of two common overlapping instance handling techniques (i.e., removing and separating techniques) on the performance of SDP models. Through an extensive case study on 230 datasets that span across industrial and open-source software projects, we observe that: i) 70.0% of SDP datasets contain overlapping instances; ii) different levels of class overlap have different impacts on the performance of SDP models. The class overlap ratio and the number of instances seriously affect the stability of the performance of SDP models; iii) class overlap affects the rank of the important feature list of SDP models, particularly the feature lists at the top 2 and top 3 ranks; IV) Class overlap handling techniques could statistically significantly improve the performance of SDP models trained on datasets with over 12.5% overlap ratios. Therefore, on the basis of these findings we suggest that future work in SDP should apply our proposed KNN method to: i) identify whether the overlap ratios of their defect datasets are greater than 12.5% before building SDP models; ii) remove the overlapping instances to find the more consistent guiding significance metrics; iii) combine RF classifier and class overlap handling techniques when reducing the efforts to review codes.

Read full abstract

The high dimensionality of software metric features has long been noted as a data quality problem that affects the performance of software defect prediction (SDP) models. This drawback makes it necessary to apply feature selection (FS) algorithm(s) in SDP processes. FS approaches can be categorized into three types, namely, filter FS (FFS), wrapper FS (WFS), and hybrid FS (HFS). HFS has been established as superior because it combines the strength of both FFS and WFS methods. However, selecting the most appropriate FFS (filter rank selection problem) for HFS is a challenge because the performance of FFS methods depends on the choice of datasets and classifiers. In addition, the local optima stagnation and high computational costs of WFS due to large search spaces are inherited by the HFS method. Therefore, as a solution, this study proposes a novel rank aggregation-based hybrid multifilter wrapper feature selection (RAHMFWFS) method for the selection of relevant and irredundant features from software defect datasets. The proposed RAHMFWFS is divided into two stepwise stages. The first stage involves a rank aggregation-based multifilter feature selection (RMFFS) method that addresses the filter rank selection problem by aggregating individual rank lists from multiple filter methods, using a novel rank aggregation method to generate a single, robust, and non-disjoint rank list. In the second stage, the aggregated ranked features are further preprocessed by an enhanced wrapper feature selection (EWFS) method based on a dynamic reranking strategy that is used to guide the feature subset selection process of the HFS method. This, in turn, reduces the number of evaluation cycles while amplifying or maintaining its prediction performance. The feasibility of the proposed RAHMFWFS was demonstrated on benchmarked software defect datasets with Naïve Bayes and Decision Tree classifiers, based on accuracy, the area under the curve (AUC), and F-measure values. The experimental results showed the effectiveness of RAHMFWFS in addressing filter rank selection and local optima stagnation problems in HFS, as well as the ability to select optimal features from SDP datasets while maintaining or enhancing the performance of SDP models. To conclude, the proposed RAHMFWFS achieved good performance by improving the prediction performances of SDP models across the selected datasets, compared to existing state-of-the-arts HFS methods.

Read full abstract

Software Defect Prediction Datasets Research Articles

Related Topics

Articles published on Software Defect Prediction Datasets

The effect of data complexity on classifier performance

An efficient instance selection algorithm for fast training of support vector machine for cross-project software defect prediction pairs

Implementation of Chernobyl optimization algorithm based feature selection approach to predict software defects

HYBRID BINARY WHALE OPTIMIZATION ALGORITHM BASED ON TAPER SHAPED TRANSFER FUNCTION FOR SOFTWARE DEFECT PREDICTION

An Integrated Semi-supervised Software Defect Prediction Model

A New Improved Prediction of Software Defects Using Machine Learning-based Boosting Techniques with NASA Dataset

Feature Selection Using Firefly Algorithm With Tree-Based Classification In Software Defect Prediction

A Comprehensive Investigation of the Impact of Class Overlap on Software Defect Prediction

Empirical Study: How Issue Classification Influences Software Defect Prediction

On the Value of Oversampling for Deep Learning in Software Defect Prediction

Hellinger Net: A Hybrid Imbalance Learning Model to Improve Software Defect Prediction

RFC: A feature selection algorithm for software defect prediction

A Novel Rank Aggregation-Based Hybrid Multifilter Wrapper Feature Selection Method in Software Defect Prediction.

LIMCR: Less-Informative Majorities Cleaning Rule Based on Naïve Bayes for Imbalance Learning in Software Defect Prediction

Exploring High-Order Correlations for Industry Anomaly Detection

A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction

DP-Share: Privacy-Preserving Software Defect Prediction Model Sharing Through Differential Privacy

A novel modified undersampling (MUS) technique for software defect prediction

Performance Evaluation of Classification Algorithms Using MCDM and Rank Correlation Method Applied on Software Defect Prediction Datasets

An empirical study on pareto based multi-objective feature selection for software defect prediction

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Software Defect Prediction Datasets Research Articles

Related Topics

Articles published on Software Defect Prediction Datasets

The effect of data complexity on classifier performance

An efficient instance selection algorithm for fast training of support vector machine for cross-project software defect prediction pairs

Implementation of Chernobyl optimization algorithm based feature selection approach to predict software defects

HYBRID BINARY WHALE OPTIMIZATION ALGORITHM BASED ON TAPER SHAPED TRANSFER FUNCTION FOR SOFTWARE DEFECT PREDICTION

An Integrated Semi-supervised Software Defect Prediction Model

A New Improved Prediction of Software Defects Using Machine Learning-based Boosting Techniques with NASA Dataset

Feature Selection Using Firefly Algorithm With Tree-Based Classification In Software Defect Prediction

A Comprehensive Investigation of the Impact of Class Overlap on Software Defect Prediction

Empirical Study: How Issue Classification Influences Software Defect Prediction

On the Value of Oversampling for Deep Learning in Software Defect Prediction

Hellinger Net: A Hybrid Imbalance Learning Model to Improve Software Defect Prediction

RFC: A feature selection algorithm for software defect prediction

A Novel Rank Aggregation-Based Hybrid Multifilter Wrapper Feature Selection Method in Software Defect Prediction.

LIMCR: Less-Informative Majorities Cleaning Rule Based on Naïve Bayes for Imbalance Learning in Software Defect Prediction

Exploring High-Order Correlations for Industry Anomaly Detection

A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction

DP-Share: Privacy-Preserving Software Defect Prediction Model Sharing Through Differential Privacy

A novel modified undersampling (MUS) technique for software defect prediction

Performance Evaluation of Classification Algorithms Using MCDM and Rank Correlation Method Applied on Software Defect Prediction Datasets

An empirical study on pareto based multi-objective feature selection for software defect prediction