Weighing lexical information for software clustering in the context of architecture recovery

Anna Corazza,Valerio Maggio,Sergio Di Martino,Giuseppe Scanniello

doi:10.1007/s10664-014-9347-3

Abstract

In this paper, we present a software clustering approach that leverages the information conveyed by the zone in which each lexeme appears in the classes of object oriented systems. We define six zones in the source code: Class Name, Attribute Name, Method Name, Parameter Name, Comment, and Source Code Statement. These zones may convey information with different levels of relevance, and so their contribution should be differently weighed according to the software system under study. To this aim, we define a probabilistic model of the lexemes distribution whose parameters are automatically estimated by the Expectation-Maximization algorithm. The weights of the zones are then exploited to compute similarities among source code classes, which are then grouped by a k-Medoid clustering algorithm. To assess the validity of our solution in the software architecture recovery field, we applied our approach to 19 software systems from different application domains. We observed that the use of our probabilistic model and the defined zones improves the quality of clustering results so that they are close to a theoretical upper bound we have proved.

Full Text