A novel automated approach for software effort estimation based on data augmentation

Liyan Song,Xin Yao,Leandro L Minku

doi:10.1145/3236024.3236052

Liyan Song, Xin Yao + Show 1 more

Open Access

https://doi.org/10.1145/3236024.3236052

Copy DOI

Abstract

Software effort estimation (SEE) usually suffers from data scarcity problem due to the expensive or long process of data collection. As a result, companies usually have limited projects for effort estimation, causing unsatisfactory prediction performance. Few studies have investigated strategies to generate additional SEE data to aid such learning. We aim to propose a synthetic data generator to address the data scarcity problem of SEE. Our synthetic generator enlarges the SEE data set size by slightly displacing some randomly chosen training examples. It can be used with any SEE method as a data preprocessor. Its effectiveness is justified with 6 state-of-the-art SEE models across 14 SEE data sets. We also compare our data generator against the only existing approach in the SEE literature. Experimental results show that our synthetic projects can significantly improve the performance of some SEE methods especially when the training data is insufficient. When they cannot significantly improve the prediction performance, they are not detrimental either. Besides, our synthetic data generator is significantly superior or perform similarly to its competitor in the SEE literature. Therefore, our data generator plays a non-harmful if not significantly beneficial effect on the SEE methods investigated in this paper. Therefore, it is helpful in addressing the data scarcity problem of SEE.

Highlights

Software effort estimation (SEE) is the process of predicting the effort required to develop a software system
We investigate 6 SEE models: linear regression (LR), automatically transformed linear model (ATLM), k-nearest neighbour (k-NN), relevance vector machine (RVM), regression tree (RT) and support vector regression (SVR), since they are among the state-of-the-art SEE predictors [13, 39, 42, 49, 61, 71]
For small training set size, we can see from table 4(a) that the synthetic projects generated by our approach can drastically improve the performance of LR/ATLM with large effect size in five out of seven SEACRAFT data sets

Summary

INTRODUCTION

Software effort estimation (SEE) is the process of predicting the effort (e.g. in person-month or person-hour) required to develop a software system. Rather than introducing sophisticated SEE models or collecting as many completed projects as possible, we can augment SEE data set by generating synthetic projects based on the existing data. We investigated the following research questions: RQ1 Given an SEE predictor, can our synthetic data generator help improving prediction performance over the baseline that does not use synthetic data? Our synthetic projects always have positive effect on and are rarely detrimental to the baseline performance of the investigated SEE models, especially when the training data is insufficient. The main contribution of this paper is to propose and validate a novel synthetic data generator, and provide the understanding of when and why the synthetic projects generated by this approach can help improving the baseline performance of the SEE model

Data Augmentation for Classification

Data Augmentation in SEE Literature

OUR SYNTHETIC DATA GENERATOR

Synthetic Feature Generation

Synthetic Effort Generation

Further Discussions and Summary

Data Sets

Performance Evaluation

Baseline SEE Predictors Investigated

AND DISCUSSION

Effect of Synthetic Data on Performance

Comparison of Synthetic Generators

THREATS TO VALIDITY

Findings

CONCLUSIONS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A novel automated approach for software effort estimation based on data augmentation

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Oct 26, 2018
Citations: 12	License type: cc-by

Similar Papers

The adjusted analogy-based software effort estimation based on similarity distances
Nan-Hsing Chiu ... Sun-Jen Huang
The Journal of Systems & Software | VOL. 80
Nan-Hsing Chiu, et. al.Nan-Hsing Chiu ... Sun-Jen Huang
21 Jul 2006
The Journal of Systems & Software | VOL. 80

The potential benefit of relevance vector machine to software effort estimation
Liyan Song ... Leandro L Minku
-
Liyan Song, et. al.Liyan Song ... Leandro L Minku
17 Sep 2014
17 Sep 2014

Effort estimation of component-based software development – a survey
T Wijayasiriwardhane ... K.C Kang
IET Software | VOL. 5
T Wijayasiriwardhane, et. al.T Wijayasiriwardhane ... K.C Kang
01 Jan 2010
IET Software | VOL. 5

Optimization of analogy weights by genetic algorithm for software effort estimation
Sun-Jen Huang ... Nan-Hsing Chiu
Information and software technology | VOL. 48
Sun-Jen Huang, et. al.Sun-Jen Huang ... Nan-Hsing Chiu
28 Feb 2006
Information and software technology | VOL. 48

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A novel automated approach for software effort estimation based on data augmentation

Abstract

Highlights

Summary

Talk to us

Similar Papers