A comparison of code similarity analysers

Chaiyong Ragkhitwetsagul,Jens Krinke,David Clark

doi:10.1007/s10664-017-9564-7

Chaiyong Ragkhitwetsagul, Jens Krinke + Show 1 more

Open Access

https://doi.org/10.1007/s10664-017-9564-7

Copy DOI

Journal: Empirical Software Engineering	Publication Date: Oct 25, 2017
Citations: 93	License type: open-access

Affiliation: University College London

Abstract

Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.

Highlights

Assessing source code similarity is a fundamental activity in software engineering and it has many applications
We study the tools’ performances on both local and pervasive code modifications usually found in software engineering activities such as code cloning, software plagiarism, and code refactoring
The results show that, in overall, highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures

Summary

Introduction

Assessing source code similarity is a fundamental activity in software engineering and it has many applications. Pervasive modifications are code changes that affect the code globally across the whole file with multiple changes applied one after another These are code transformations that are mainly found in the course of software plagiarism when one wants to conceal copied code by changing their appearance and avoid detection (Daniela et al 2012). They represent code clones that are repeatedly modified over time during software evolution (Pate et al 2013), and source code before and after refactoring activities (Fowler 2013). Our definition of pervasive modifications excludes strong obfuscation (Collberg et al 1997), that aims to protect code from reverse engineering by making it difficult or impossible to understand

Objectives

Results

Conclusion