An experimental search-based approach to cohesion metric evaluation

Mel Ó Cinnéide,Laurence Tratt,Iman Hemati Moghadam,Mark Harman,Steve Counsell

doi:10.1007/s10664-016-9427-7

Abstract

In spite of several decades of software metrics research and practice, there is little understanding of how software metrics relate to one another, nor is there any established methodology for comparing them. We propose a novel experimental technique, based on search-based refactoring, to ‘animate’ metrics and observe their behaviour in a practical setting. Our aim is to promote metrics to the level of active, opinionated objects that can be compared experimentally to uncover where they conflict, and to understand better the underlying cause of the conflict. Our experimental approaches include semi-random refactoring, refactoring for increased metric agreement/disagreement, refactoring to increase/decrease the gap between a pair of metrics, and targeted hypothesis testing. We apply our approach to five popular cohesion metrics using ten real-world Java systems, involving 330,000 lines of code and the application of over 78,000 refactorings. Our results demonstrate that cohesion metrics disagree with each other in a remarkable 55 % of cases, that Low-level Similarity-based Class Cohesion (LSCC) is the best representative of the set of metrics we investigate while Sensitive Class Cohesion (SCOM) is the least representative, and we discover several hitherto unknown differences between the examined metrics. We also use our approach to investigate the impact of including inheritance in a cohesion metric definition and find that doing so dramatically changes the metric.

Highlights

Like all engineers, software engineers want to measure the materials they engineer, leading to many proposals for ways to measure software using so-called ‘software metrics’ (Shepperd 1995)
Our results demonstrate that cohesion metrics disagree with each other in a remarkable 55 % of cases, that Low-level Similaritybased Class Cohesion (LSCC) is the best representative of the set of metrics we investigate while Sensitive Class Cohesion (SCOM) is the least representative, and we discover several hitherto unknown differences between the examined metrics
A monolithic system implemented as class with a thousand methods would no doubt permit many refactorings that improve the employed cohesion metrics, but such enormous and poorly-designed classes do not occur in practice

Summary

Introduction

Software engineers want to measure the materials they engineer, leading to many proposals for ways to measure software using so-called ‘software metrics’ (Shepperd 1995). Previous work on metric validation has focussed on formal analysis (Weyuker 1988; Fenton 1994; Hitz and Montazeri 1996; Al Dallal 2010), empirical evaluation (Kemerer 1995; Counsell et al 2005; Meyers and Binkley 2007; Beck and Diehl 2011) or user studies (Counsell et al 2006; Bouwers et al 2013; Simons et al 2015) These approaches can establish formal metric properties and assess their applicability and usability, they cannot determine if a metric measures the property it purports to measure. Our approach can be thought of as a form of cross validation that seeks to assess the degree to which metrics that ought to agree in theory, because they all purport to measure cohesion, really do agree in practice when applied to real-world software applications

Objectives

Results

Conclusion