The role of replications in empirical software engineering—a word of warning

Barbara Kitchenham

doi:10.1007/s10664-008-9061-0

Abstract

There are many issues raised by Shull et al. that I would whole-heartedly endorse. For example, replication is a basic component of the scientific method, so it hardly needs to be justified. Furthermore, a full comprehensive report of any empirical study is a requirement of good science. When restrictions on the length of conference or journal paper prevent such reporting, it is an extremely good idea to record all the experimental details somewhere, whether in a “laboratory package” or a technical report. However, is such supporting information essential for subsequent replications? The authors give examples such as insufficient training, or systems not having defects appropriate for the technique, causing replications to wrongly contradict the original experiment. I believe issues such as the criticality of training or the type of relevant defects should be reported in the published study not hidden in ancillary data. It may be preferable to advise empirical researchers to make very clear the critical components of the software engineering methods they are studying in their journal and conference papers. I also disagree with some of the other arguments in favour of close replications. The authors suggest when arguing against independent replications that it may contradict the original experiment and researchers will not know why. They conclude “conducting a replication that has the risk of producing no useful results is too risky to be the norm”. They suggest it is therefore cheaper and more feasible to run replications that reuse aspects of the existing experiments. However, it is equally risky to undertake a long series of experiments with relatively small changes if the experiments propagate a problem in the original experiment. The authors note this problem but don’t seem willing to assess its implications. By Kampenes (2007) reported that something like 30% of the 113 experiments reviewed by Sjoberg et al. (2005) were quasi-experiments and six of those used one of the weakest forms of quasi-experiment—a within subject study without any cross-over (i.e. all subjects used treatment A first and treatment B second). This design completely confounds treatment and order. Furthermore any learning effects that occur (particularly an issue when the subjects are students) are also confounded with treatment. Several of these studies Empir Software Eng (2008) 13:219–221 DOI 10.1007/s10664-008-9061-0

Full Text