Abstract

Code clone refers to a pair of semantically similar but syntactically similar or different code fragments that exist in code base. Excessive code clones in software system will cause a negative impact on system development and maintenance. In recent years, as deep learning has become a hot research area of machine learning, researchers have tried to apply deep learning techniques to code clone detection tasks. They have proposed a series of detection techniques using including unstructured (code in the form of sequential tokens) and structured (code in the form of abstract syntax trees and control-flow graphs) information to detect semantically similar but syntactically different code clone, which is the most difficult-to-detect clone type. However, although these methods have achieved an important improvement in the precision of semantic code clone detection, the corresponding false positive rate(FPR) is also at a very high level, making these methods unable to be effectively applied to real-world code bases. This paper proposed SCCD-GAN, an enhanced semantic code clone detection model which based on a graph representation form of programs and uses Graph Attention Network to measure the similarity of code pairs and achieved a lower detection FPR than existing methods. We built the graph representation of the code by expanding the control flow and data flow information to the original abstract syntax tree, and equipped with an attention mechanism to our model that focuses on the most important code parts and features which contribute much to the final detection precision.We implemented and evaluated our proposed method based on the benchmark dataset in the field of code clone detection-BigCloneBench2 and Google Code Jam. SCCD-GAN performed better than the existing state-of-the-art methods in terms of precision and false positive rate.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call