Improved Analogy-based Effort Estimation with Incomplete Mixed Data

Ibtissam Abnane,Ali Idri

doi:10.15439/2018f95

Abstract

Estimation by analogy (EBA) is one of the most attractive software effort development estimation techniques. However, one of the critical issues when using EBA is the occurrence of missing data (MD) in the historical data sets. The absence of values of several relevant software attributes is a frequent phenomenon that may cause inaccurate EBA estimations. The MD can be numerical and/or categorical. This paper evaluates four MD techniques (toleration, deletion, k-nearest neighbors (KNN) imputation and support vector regression (SVR) imputation) over four mixed data sets. A total of 432 experiments were conducted involving four MD techniques, nine MD percentages (from 100% to 90%), three missingness mechanisms (MCAR: Missing Completely at Random, MAR: Missing at Random and NIM: Non-Ignorable Missing) and four data sets. The evaluation process consists of four steps and uses several accuracy measures such as standardized accuracy (SA) and prediction level (Pred). The results suggest that EBA with imputation techniques achieved significantly better SA values over EBA with toleration or deletion regardless of the mechanism of missingness. Moreover, no particular MD imputation technique outperformed the other techniques overall. However, according to Pred and other accuracy criteria, EBA with SVR was the best, followed by KNN imputation; we also found that toleration instead of deletion improves the accuracy of EBA.

Highlights

SOFTWARE development effort estimation (SDEE) is the process of predicting the effort required to develop a software system
(RQ1) Is there evidence that the use of KNN and support vector regression (SVR) imputations rather than toleration/deletion improves the accuracy of Estimation by Analogy (EBA) in terms of standardized accuracy (SA) when using mixed datasets?
This study evaluated EBA using four missing data (MD) techniques: toleration, deletion, KNN imputation, and SVR imputation with different percentages and three missingness mechanisms (MCAR, Missing at Random (MAR) and Non-Ignorable Missing (NIM)) on four datasets (ISBSG R8, COCOMO81, USP05_FT and USP05_RQ) with mixed data. four research questions research questions were discussed (RQs) 1-4 have been discussed

Summary

Introduction

SOFTWARE development effort estimation (SDEE) is the process of predicting the effort required to develop a software system. It is a challenging and substantial activity when managing a software project. Machine learning (ML) based estimation techniques are gaining increasing attention in SDEE research, as they can model the complex relationship between effort and software attributes (cost drivers), especially when this relationship is not linear and does not seem to have any predetermined form [2]. Estimation by Analogy (EBA) is one of the most attractive ML techniques in the SDEE field, and is essentially a form of Case-Based Reasoning (CBR)[3]. Toleration is not a reliable approach, sometimes even providing estimates that are less efficient than estimation from deletion technique [8], [17]

Objectives

Results

Conclusion