Shotgun correlations in software measures

Richard E Courtney,David A Gustafson

doi:10.1049/sej.1993.0002

Abstract

Many software measures have been forwarded on the simple basis of a high linear correlation coefficient with some measurable quantities. The linear correlation coefficient is an unreliable statistic for deciding whether an observed correlation indicates significant association. Several published software measure experiments collected more than 20 different measurements, or have 14 or fewer observations. With considerable data from small samples, the probability of ‘discovering’ a ‘significant’ correlation is high. We present a computer simulation experiment where the correlation between sets of randomly generated numbers is calculated. We also look at randomly generated numbers in the ranges that would be expected in Halstead's Software Science [1] measures. Our results show that the average maximum linear correlation for randomly generated numbers is 0.70 or higher if the sample size is low compared to the number of variables. Alternative statistical approaches to obtaining meaningful significant results are presented.

Full Text