Sample size vs. bias in defect prediction

Foyzur Rahman,Israel Herraiz,Daryl Posnett,Premkumar Devanbu

doi:10.1145/2491411.2491418

Abstract

Most empirical disciplines promote the reuse and sharing of datasets, as it leads to greater possibility of replication. While this is increasingly the case in Empirical Software Engineering, some of the most popular bug-fix datasets are now known to be biased. This raises two significant concerns: first, that sample bias may lead to underperforming prediction models, and second, that the external validity of the studies based on biased datasets may be suspect. This issue has raised considerable consternation in the ESE literature in recent years. However, there is a confounding factor of these datasets that has not been examined carefully: size. Biased datasets are sampling only some of the data that could be sampled, and doing so in a biased fashion; but biased samples could be smaller, or larger. Smaller data sets in general provide less reliable bases for estimating models, and thus could lead to inferior model performance. In this setting, we ask the question, what affects performance more, bias, or size? We conduct a detailed, large-scale meta-analysis, using simulated datasets sampled with bias from a high-quality dataset which is relatively free of bias. Our results suggest that size always matters just as much bias direction, and in fact much more than bias direction when considering information-retrieval measures such as AUCROC and F-score. This indicates that at least for prediction models, even when dealing with sampling bias, simply finding larger samples can sometimes be sufficient. Our analysis also exposes the complexity of the bias issue, and raises further issues to be explored in the future.

Highlights

Detailed data on bugs are clearly crucial to empirical studies of software quality
Since we are introducing bias into our training sets, which may disturb the relationship between the training set and test set distributions, ridge regression provides insurance that we can be reasonably certain that the bias introduced by multicollinearity is not impacting the quality of our prediction models in the face of bias introduced by our experimental setup
This has led to widespread concerns, reported in several papers, that biased datasets would lead to under-performing, even misleading, prediction models of limited practical value

Summary

Introduction

Detailed data on bugs are clearly crucial to empirical studies of software quality Such data is generally collected in bug-fix datasets. The fix location is provided by a link to commits in the version control system These links identify the source code files involved in a bug report, as well as other details, such as the developer who committed the fix, the date and time, and the lines changed in the corresponding files. This is a rich source of historical data for building software quality prediction models that may yield improved understanding of the factors that affect software quality

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Sample size vs. bias in defect prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Aug 18, 2013
Citations: 113	License type: cc-by

Similar Papers

The direction of sampling bias when hooking fish from a sea cage: case study of fluke monitoring in Australian farmed Yellowtail Kingfish (Seriola lalandi)
Caraguel Charles ... Landos Matt
Frontiers in Veterinary Science | VOL. 3
Caraguel Charles, et. al.Caraguel Charles ... Landos Matt
01 Jan 2015
Frontiers in Veterinary Science | VOL. 3

Prediction models: stepwise development and simultaneous validation is a step back
Georg Heinze ... Ben Van Calster
Journal of Clinical Epidemiology | VOL. 142
Georg Heinze, et. al.Georg Heinze ... Ben Van Calster
01 Aug 2021
Journal of Clinical Epidemiology | VOL. 142

Data requirements and data sources for biodiversity priority area selection.
P H Williams ... C R Margules
Journal of Biosciences | VOL. 27
P H Williams, et. al.P H Williams ... C R Margules
01 Jul 2002
Journal of Biosciences | VOL. 27

Crowdsourcing Detection of Sampling Biases in Image Datasets
Xiao Hu ... Yung-Hsiang Lu
-
Xiao Hu, et. al.Xiao Hu ... Yung-Hsiang Lu
20 Apr 2020
20 Apr 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sample size vs. bias in defect prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers