CDS: A Cross–Version Software Defect Prediction Model With Data Selection

Jie Zhang,Zibin Zheng,Jiajing Wu,Chuan Chen,Michael R. Lyu

doi:10.1109/access.2020.3001440

Abstract

Over the past decade, a large number of software defect prediction approaches have been proposed to identify the defect-prone modules by mining software repositories. Recently, a novel scenario called Cross-Version Defect Prediction (CVDP) begins to draw increasing research interests, as it is more reasonable and applicable in practice to adopt the labeled defect data of previous versions to predict defects in the current version of the same project. As a software project often has multiple previous versions, CVDP on this kind of projects will face the following two critical but seldom reported issues, namely, data distribution difference and class overlapping. In this paper, we address these two issues by solving a version selection problem via a Cross-version model with Data Selection (CDS). The proposed CDS is a novel framework which treats the defect prediction of existing and new files in different ways. For the existing files, we propose a novel Clustering-based Multi-Version Classifier (CMVC), which can automatically select the training data from the most relevant and noise-free versions by assigning them higher weights than the others. We proposed a Weighted Sampling Model (WSM) for the new files which have never appeared in previous version by incorporating the outputs of CMVC. We evaluate the proposed CDS model on 28 versions across 8 software projects, and the experimental results demonstrate that CDS outperforms three baseline methods and a state-of-the-art approach in terms of three prevalent performance indicators.

Highlights

Defects in a software system may cause improper behaviors and even lead to great financial loss and critical safety accidents
Datasets collected from different versions of the same software project may contain instances with similar features but opposite labels, which can further cause the issue of class overlapping for the classification problem and result in degradation of the predictive performance. Motivated by these two critical but rarely discussed issues for Cross–Version Defect Prediction (CVDP), we propose a novel cross-version defect prediction model with data selection to address the following three research questions: RQ1 For the same software project, is there a significant difference between the data distributions of various versions? And does this difference affect the performance of defect prediction models?
In this paper, we discussed the advantages of cross-version defect prediction (CVDP) for practical use compared with within-project defect prediction and cross-project defect prediction

Summary

Introduction

Defects in a software system may cause improper behaviors and even lead to great financial loss and critical safety accidents. Techniques such as testing and code reviews are adopted to identify and correct defects in software systems. Defect prediction is often formulated as a supervised binary classification problem. Defect prediction in software engineering is often formulated as a supervised binary classification problem. A. CROSS–PROJECT DEFECT PREDICTION The scenario of CPDP has been proposed to address the problem of data insufficiency, which WPDP often suffers from, by utilizing training data from other projects. While in [17], Wu et al applying an unified semi-supervised approach to deal with the insufficiency of historical data for both cross project and within project scenarios

Methods

Results

Conclusion