An empirical study of the classification performance of learners on imbalanced and noisy software quality data

Chris Seiffert,Taghi M Khoshgoftaar,Jason Van Hulse,Andres Folleco

doi:10.1016/j.ins.2010.12.016

Abstract

Data mining techniques are commonly used to construct models for identifying software modules that are most likely to contain faults. In doing so, an organization’s limited resources can be intelligently allocated with the goal of detecting and correcting the greatest number of faults. However, there are two characteristics of software quality datasets that can negatively impact the effectiveness of these models: class imbalance and class noise. Software quality datasets are, by their nature, imbalanced. That is, most of a software system’s faults can be found in a small percentage of software modules. Therefore, the number of fault-prone, fp, examples (program modules) in a software project dataset is much smaller than the number of not fault-prone, nfp, examples. Data sampling techniques attempt to alleviate the problem of class imbalance by altering a training dataset’s distribution. A program module contains class noise if it is incorrectly labeled. While several studies have been performed to evaluate data sampling methods, the impact of class noise on these techniques has not been adequately addressed. This work presents a systematic set of experiments designed to investigate the impact of both class noise and class imbalance on classification models constructed to identify fault-prone program modules. We analyze the impact of class noise and class imbalance on 11 different learning algorithms (learners) as well as 7 different data sampling techniques. We identify which learners and which data sampling techniques are most robust when confronted with noisy and imbalanced data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An empirical study of the classification performance of learners on imbalanced and noisy software quality data

Abstract

Talk to us

Similar Papers

More From: Information Sciences

Lead the way for us

Journal: Information Sciences	Publication Date: Jan 9, 2011
Citations: 122

Similar Papers

An Empirical Study of the Classification Performance of Learners on Imbalanced and Noisy Software Quality Data
Chris Seiffert ... Andres Folleco
-
Chris Seiffert, et. al.Chris Seiffert ... Andres Folleco
01 Aug 2007
01 Aug 2007

Supervised Neural Network Modeling: An Empirical Investigation Into Learning From Imbalanced Data With Labeling Errors
Taghi M Khoshgoftaar ... Amri Napolitano
IEEE Transactions on Neural Networks | VOL. 21
Taghi M Khoshgoftaar, et. al.Taghi M Khoshgoftaar ... Amri Napolitano
15 Mar 2010
IEEE Transactions on Neural Networks | VOL. 21

Knowledge discovery from imbalanced and noisy data
Jason Van Hulse ... Taghi Khoshgoftaar
Data & Knowledge Engineering | VOL. 68
Jason Van Hulse, et. al.Jason Van Hulse ... Taghi Khoshgoftaar
23 Aug 2009
Data & Knowledge Engineering | VOL. 68

How to Optimally Combine Univariate and Multivariate Feature Selection with Data Sampling for Classifying Noisy, High Dimensional and Class Imbalanced DNA Microarray Data#
Ahmad Abu Shanab ... Taghi M Khoshgoftaar
-
Ahmad Abu Shanab, et. al.Ahmad Abu Shanab ... Taghi M Khoshgoftaar
24 Mar 2020
24 Mar 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An empirical study of the classification performance of learners on imbalanced and noisy software quality data

Abstract

Talk to us

Similar Papers

More From: Information Sciences