Multiple Similarity-based Features Blending for Detecting Code Clones using Consensus-Driven Classification

Abdullah M Sheneamer

doi:10.1016/j.eswa.2021.115364

Abstract

Code clone detection helps to reduce the costs associated with software maintenance and bug prevention. Machine learning methods have previously suggested many ways by which to detect code clones. The majority of clone detectors are traditional in their approach, they can detect syntactic clones but are poor at detecting semantic clones. Researchers use machine learning to detect semantic clones and automatically scan the data to learn latent semantic features. In this study, we have introduced a new formal model of similarity which combines similarity measures so that method blocks can measure both the syntactic and semantic distances between method block pairs. The uniqueness of our study is in the use of different similarity measures, and similarity scores as features in machine learning, to detect code clones. We use a number of similarity measure computations to extract similarity score features, these features are then represented as vectors. Using ensemble classification models, we perform extensive comparisons and evaluations of the effectiveness of our proposed idea. The results indicate that our approach is significantly better at detecting clone types compared to contemporary code clone detectors. We achieved a 99% success rate in detecting cloned codes based on F-score, recall, and precision. Our approach achieves 98–100% accuracy in the majority of cases.

Full Text