A Paired Learner-Based Approach for Concept Drift Detection and Adaptation in Software Defect Prediction

Arvind Kumar Gangwar,Alok Mishra,Sandeep Kumar

doi:10.3390/app11146663

Abstract

The early and accurate prediction of defects helps in testing software and therefore leads to an overall higher-quality product. Due to drift in software defect data, prediction model performances may degrade over time. Very few earlier works have investigated the significance of concept drift (CD) in software-defect prediction (SDP). Their results have shown that CD is present in software defect data and tha it has a significant impact on the performance of defect prediction. Motivated from this observation, this paper presents a paired learner-based drift detection and adaptation approach in SDP that dynamically adapts the varying concepts by updating one of the learners in pair. For a given defect dataset, a subset of data modules is analyzed at a time by both learners based on their learning experience from the past. A difference in accuracies of the two is used to detect drift in the data. We perform an evaluation of the presented study using defect datasets collected from the SEACraft and PROMISE data repositories. The experimentation results show that the presented approach successfully detects the concept drift points and performs better compared to existing methods, as is evident from the comparative analysis performed using various performance parameters such as number of drift points, ROC-AUC score, accuracy, and statistical analysis using Wilcoxon signed rank test.

Highlights

To minimize software testing efforts by predicting defect-prone software modules beforehand, many software defect-prediction (SDP) approaches, as described in [1,2,3,4,5,6,7,8], have been presented so far
We evaluated a few available learners including decision tree (DT), K-nearest neighbor (KNN), naive Bayes (NB), random forest (RF), and support vector machine (SVM)
This experiment first found out the presence of concept drift (CD) in our studied software defect datasets using the proposed paired learner (PL) based method

Summary

Introduction

To minimize software testing efforts by predicting defect-prone software modules beforehand, many software defect-prediction (SDP) approaches, as described in [1,2,3,4,5,6,7,8], have been presented so far These studies showed that SDP models analyze software metrics data to predict and fix bugs early in the software-development process to improve software testability and to improve the overall software quality [9]. The authors in [16,17] provided a formal definition of concept drift in terms of the statistical properties of a target variable over a period of time and the joint probabilities of the feature vector and target variable These researchers discovered three sources of CD in terms of the joint probabilities of a feature vector and a class variable. In terms of statistical properties of target variable, we can define CD as follows—

Objectives

Results

Conclusion