A Framework for Studying Clones In Large Software Systems

Zhen Ming Jiang,Ahmed E Hassan

doi:10.1109/scam.2007.4362914

Abstract

Clones are code segments that have been created by copying-and-pasting from other code segments. Clones occur often in large software systems. It is reported that 5 to 50% of the source code of a large software system is cloned. A major challenge when studying code cloning in large software systems is handling the large amount of clone candidates produced by leading edge clone detection tools. For example, the CCFinder, clone detection tool, produces over 7 million pairs of clone candidates for the Linux kernel (which consists of over 4MLOC). Moreover, the output of clone detection tools grows rapidly as a software system evolves. Researchers and developers need tools to help them study the large amount of clone data in order to better understand the clone phenomena in large systems. In this paper, we propose a data mining framework to help researchers cope with the large amount of data produced by clone detection tools. We propose techniques to reduce, abstract and highlight the most interesting data produced by clone detection tools. Our framework also introduces a visualization tool which allows users to query and explore clone data at various abstraction levels. We demonstrate our framework on a case study of the clone phenomena in the Linux kernel.

Full Text