Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance

Kiran Kumar Bejjanki,Narsimha Gugulothu,Jayadev Gyani

doi:10.3390/sym12030407

Kiran Kumar Bejjanki, Narsimha Gugulothu + Show 1 more

Open Access

PDF Available

https://doi.org/10.3390/sym12030407

Copy DOI

Export

Save

Cite

Journal: Symmetry	Publication Date: Mar 4, 2020
Citations: 34	License type: CC BY 4.0

Affiliation: Kakatiya University, Majmaah University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Software defect prediction (SDP) is the technique used to predict the occurrences of defects in the early stages of software development process. Early prediction of defects will reduce the overall cost of software and also increase its reliability. Most of the defect prediction methods proposed in the literature suffer from the class imbalance problem. In this paper, a novel class imbalance reduction (CIR) algorithm is proposed to create a symmetry between the defect and non-defect records in the imbalance datasets by considering distribution properties of the datasets and is compared with SMOTE (synthetic minority oversampling technique), a built-in package of many machine learning tools that is considered a benchmark in handling class imbalance problems, and with K-Means SMOTE. We conducted the experiment on forty open source software defect datasets from PRedict or Models in Software Engineering (PROMISE) repository using eight different classifiers and evaluated with six performance measures. The results show that the proposed CIR method shows improved performance over SMOTE and K-Means SMOTE.

Highlights

Most important activity in the testing phase of software development process is the software defect prediction (SDP) [1]
We propose a novel class imbalance reduction (CIR) technique for reducing the imbalance between defective and non-defective samples for achieving improved accuracy in software defect prediction and our technique is compared with baseline method synthetic minority oversampling technique (SMOTE) and latest variant K-Means SMOTE
K-nearest neighbor (KNN) performance is better than other classifiers in accuracy, precision and specificity whereas logistic regression is performing well in recall, F-measure and geometric mean

Summary

Introduction

Most important activity in the testing phase of software development process is the software defect prediction (SDP) [1]. SDP identifies defect prone modules which need rigorous testing. By identifying defect prone modules well in advance, testing engineers can use testing resources efficiently without violating the constraints. The SDP is most useful in the testing phase, it is not always easy to predict the defect prone modules. There are different issues which obstruct the algorithm performance as well as use of the defect prediction methods. The AdaBoost algorithm works as follows: AdaBoost randomly selects a subset from training data. It trains the chosen machine learning model iteratively by selecting the training dataset based on the accurate prediction of the last training

Methods

Results

Conclusion