Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering

Lina Gong,Shujuan Jiang,Li Jiang

doi:10.1109/access.2019.2945858

Lina Gong, Shujuan Jiang + Show 1 more

Open Access

https://doi.org/10.1109/access.2019.2945858

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 40	License type: CC BY 4.0

Affiliation: China University of Mining and Technology

Abstract

In practice, Software Defect Prediction (SDP) models often suffer from highly imbalanced data, which makes classifiers difficult to identify defective instances. Recently, many techniques were proposed to tackle this problem, over-sampling technique is one of the most well-known methods to address class imbalance problem. This technique balances the number of defective and non-defective instances by generating new defective instances. However, these approaches would generate non-diverse synthetic instances, and many unnecessary noise instances at the same time. Motived by this, we propose a Cluster-based Over-sampling with noise filtering (KMFOS) approach to tackle class imbalance problem in SDP. KMFOS firstly divides defective instances into $K$ clusters, and new defective instances are generated by interpolation between instances of each two clusters. After this, these new defective instances would diversely spread in the space of defective dataset. Then, we extend this cluster-based over-sampling through the Closest List Noise Identification (CLNI) to clean the noise instances. We do extensive experiments on 24 projects to compare KMFOS with some over-sampling approaches such as SMOTE, Borderline-SMOTE, ADASYN, random over-sampling (ROS), K-means SMOTE, SMOTE + IPF, SMOTE + ENN and SMOTE + Tomek Links using five prediction classifiers. At the same time, we also compare KMFOS with other state-of-the-art class-imbalance methods including balancebaggingclassifier, RUSboostclassifier, InstanceHardnessThreshold and cost-sensitive methods. Experimental results indicate our KMFOS can obtain better Recall and bal values than other over-sampling methods and other compared class-imbalance methods. Hence, KMFOS is an efficient approach to generate balanced data for SDP and improves the performance of predicting models.

Highlights

Software defect prediction (SDP) technologies can detect the largest number of defective modules by machine learning methods [1], [2], [30]
We propose a cluster-based over-sampling with filtering approach (KMFOS) that improve the recognition rate of defective instances and reduce the misclassified rate of non-defective instances for classifiers simultaneously
Experimental results indicate that our KMFOS significantly improves the Recall and bal in SDP

Summary

Introduction

Software defect prediction (SDP) technologies can detect the largest number of defective modules by machine learning methods [1], [2], [30]. These machine learning methods may achieve good prediction performance when these training datasets are balanced [3], [4], [29]. There are more non-defective instances than defective instances in software projects, which leads class-imbalanced problem in SDP. The prevalent methods tackling class imbalance problem are mainly sampling, cost-sensitive and ensemble learning methods. Chawla et al [9] proposed synthetic minority over-sampling technique (SMOTE) using ROS as the core idea, whereby new artificial minority instances are generated to strike a balance in the number of minority and majority class

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study
Sushant Kumar Pandey ... Anil Kumar Tripathi
-
Sushant Kumar Pandey, et. al.Sushant Kumar Pandey ... Anil Kumar Tripathi
01 Jul 2021
01 Jul 2021

Is Open-Source Software Valuable for Software Defect Prediction of Proprietary Software and Vice Versa?
Misha Kakkar ... Sarika Jain
-
Misha Kakkar, et. al.Misha Kakkar ... Sarika Jain
25 Nov 2017
25 Nov 2017

Optimization of software defects prediction in imbalanced class using a combination of resampling methods with support vector machine and logistic regression
Catur Iswahyudi ... Windyaning Ustyannie
JURNAL INFOTEL | VOL. 13
Catur Iswahyudi, et. al.Catur Iswahyudi ... Windyaning Ustyannie
09 Dec 2021
JURNAL INFOTEL | VOL. 13

Supp1-3131950.pdf
Gopi Krishnan Rajbahadur
-
Gopi Krishnan RajbahadurGopi Krishnan Rajbahadur
02 Dec 2021
Supp1-3131950.pdf
Gopi Krishnan Rajbahadur

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access