Semantic Clone Detection Using Machine Learning

Abdullah Sheneamer,Jugal Kalita

doi:10.1109/icmla.2016.0185

Abstract

If two fragments of source code are identical to each other, they are called code clones. Code clones introduce difficulties in software maintenance and cause bug propagation. In this paper, we present a machine learning framework to automatically detect clones in software, which is able to detect Types-3 and the most complicated kind of clones, Type-4 clones. Previously used traditional features are often weak in detecting the semantic clones The novel aspects of our approach are the extraction of features from abstract syntax trees (AST) and program dependency graphs (PDG), representation of a pair of code fragments as a vector and the use of classification algorithms. The key benefit of this approach is that our approach can find both syntactic and semantic clones extremely well. Our evaluation indicates that using our new AST and PDG features is a viable methodology, since they improve detecting clones on the IJaDataset 2.0.

Full Text