Code cloning (CC) is the process of copying and reconfiguring a code fragment and using it in another part of a software project. This clones increases the running overhead of the software. As a result, Code Clone Detection (CCD) has become an active research area in software development research. The detection of Large-Variance Code Clones (LV-CCs) is very difficult when the lines of codes (LOCs) in the source code are very large. The distance metrics have been used in LV-CC detection by calculating the distance between training feature sets of source codes and testing feature sets of source codes. However, threshold selection for detecting clones is a challenging issue in distance-based LV-CC detection. To solve this, a Collaborative CCD using Deep Learning (CCCD-DL) is developed in this paper by utilising lexical, syntactic, semantic and structural features for identifying all types of clones together. A lexical feature is extracted from Clone Pairs (CPs) identified by LV-Mapper. Syntactic and semantic features are identified by the Abstract Syntax Tree (AST) and Control Flow Graph (CFG). The structural features are extracted by code size metrics (CZMs) and object-oriented metrics (OOMs). All features are coordinated and fed into the input layer of DNN. The hidden layer then transforms the inputs into the neural vertices in the multi-classification stage using linear transformation preceded by suppressing non-linearity. This process can generate a complicated and non-hypothetical prototype with a weight matrix for fitting the training sequence. Thus, the feed-forward step has been successfully completed. This model then uses back-propagation in the following element to modify the weight matrix based on the training set. Finally, a softmax layer converts the clone detection task into a classification process. The results of the experiments show that the proposed method solves distance-based problems more quickly and effectively than the traditional methods for the CCD.
Read full abstract