Investigating the Use of Chronological Splitting to Compare Software Cross-company and Single-company Effort Predictions: A Replicated Study

Emilia Mendes,Chris Lokan

doi:10.14236/ewic/ease2009.2

Abstract

CONTEXT: Three previous studies have investigated the use of chronological split to compare cross- to single-company effort predictions, where all used the ISBSG dataset release 10. Therefore there is a need for these studies to be replicated using different datasets such that the patterns previously observed can be compared and contrasted, and a better understanding with regard to the use of chronological splitting can be reached. OBJECTIVE: The aim of this study is to replicate [17] using the same chronological splitting; however a different database - the Finnish dataset. METHOD: Chronological splitting was compared with two forms of cross-validation. The chronological splitting used was the project-by-project chronological split, in which a validation set contains a single project, and a regression model is built from scratch using as training set the set of projects completed before the validation project's start date. We used 201 single-company projects and 593 cross-company projects from the Finnish dataset. RESULTS: Single-company models presented significantly better prediction than cross-company models. Chronological splitting provided significantly worse accuracy than leave-one and leave-two out cross-validations when based on single-company data; and provided similar accuracy when based on cross-company data. CONCLUSIONS: Results did not seem promising when using project-by-project splitting; however in a real scenario companies that use their own data can only apply some sort of chronological splitting when obtaining effort estimates for their new projects. Therefore we urge the use of chronological splitting in effort estimation studies such that more realistic results can be provided to inform industry.

Highlights

Numerous software companies find it difficult to accumulate data on their past finished projects, and yet, to remain competitive, they must provide project effort estimates that are accurate
With project-by-project chronological splitting we only built regression models whenever there were at least 12 training projects that had been completed prior to start of the single-company project for which effort was to be estimated
4.1 Cross-company Model The cross-company models used as part of the project-by-project chronological splitting procedure will not be described as there were 199 different models that were automatically fit using the statistical language R

Summary

Introduction

Numerous software companies find it difficult to accumulate data on their past finished projects, and yet, to remain competitive, they must provide project effort estimates that are accurate. The normal approach in these studies – and in almost all work in software engineering that builds effort estimation models from historical data – involves separating the data into a training set (from which the model is built) and a validation set (used to assess a model’s accuracy). The assignment of projects to training and validation sets is done without regard to the completion date of the projects This makes it very likely (in leave-one-out cross-validation it is certain) that the data used to build a model to estimate effort for a given project p include projects that were completed after p was finished

Objectives

Methods

Results

Conclusion