Big data clone detection using classical detectors: an exploratory study

Jeffrey Svajlenko,Iman Keivanloo,Chanchal K Roy

doi:10.1002/smr.1662

Abstract

AbstractBig data analysis is an emerging research topic in various domains, and clone detection is no exception. The goal is to create big data inter‐project clone corpora across open‐source or corporate‐source code repositories. Such corpora can be used to study developer behavior and to reduce engineering costs by extracting globally duplicated efforts into new APIs and as a basis for code completion and API usage support. However, building scalable clone detection tools is challenging. It is often impractical to use existing state‐of‐the‐art tools to analyze big data because the memory and execution time required exceed the average user's resources. Some tools have inherent limitations in their data structures and algorithms that prevent the analysis of big data even when extraordinary resources are available. These limitations are impossible to overcome if the source code of the tool is unavailable or if the user lacks the time or expertise to modify the tool without harming its performance or accuracy. In this research, we have investigated the use of our shuffling framework for scaling classical clone detection tools to big data. The framework achieves scalability on commodity hardware by partitioning the input dataset into subsets manageable by the tool and computing resources. A non‐deterministic process is used to randomly ‘shuffle’ the contents of the dataset into a series of subsets. The tool is executed for each subset, and its output for each is merged into a single report. This approach does not require modification to the subject tools, allowing their individual strengths and precision to be captured at an acceptable loss of recall. In our study, we explored the performance and applicability of the framework for the big data dataset, IJaDataset 2.0, which consists of 356 million lines of code from 25,000 open‐source Java projects. We begin with a computationally inexpensive version of our framework based on pure random shuffling. This version was successful at scaling the tools to IJaDataset but required many subsets to achieve a desirable recall. Using our findings, we incrementally improved the framework to achieve a satisfactory recall using fewer resources. We investigated the use of efficient file tracking and file‐similarity heuristics to bias the shuffling algorithm toward subsets of the dataset that contain undetected clone pairs. These changes were successful in improving the recall performance of the framework. Our study shows that the framework is able to achieve up to 90–95% of a tool's native recall using standard hardware. Copyright © 2014 John Wiley & Sons, Ltd.

Full Text