Semantic Clone Detection: Can Source Code Comments Help?

Akash Ghosh,Sandeep Kaur Kuttal

doi:10.1109/vlhcc.2018.8506550

Abstract

Programmers reuse code to increase their productivity, which leads to large fragments of duplicate or near-duplicate code in the code base. The current code clone detection techniques for finding semantic clones utilize Program Dependency Graphs (PDG), which are expensive and resource-intensive. PDG and other clone detection techniques utilize code and have completely ignored the comments - due to ambiguity of English language, but in terms of program comprehension, comments carry the important domain knowledge. We empirically evaluated the accuracy of detecting clones with both code and comments on a JHotDraw package. Results show that detecting code clones in the presence of comments, Latent Dirichlet Allocation (LDA), gave 84% precision and 94% recall, while in the presence of a PDG, using GRAPLE, we got 55% precision and 29% recall. These results indicate that comments can be used to find semantic clones. We recommend utilizing comments with LDA to find clones at the file level and code with PDG for finding clones at the function level. These findings necessitate a need to reexamine the assumptions regarding semantic clone detection techniques.

Full Text