Abstract

Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection. In computer programming teaching, students may convert the source code written in one programming language into another language for their code assignment submission. Existing similarity measures of source code written in the same language are not applicable for the cross-language code similarity detection because of syntactic differences among different programming languages. Meanwhile, existing cross-language source similarity detection approaches are susceptible to complex code obfuscation techniques, such as replacing equivalent control structure and adding redundant statements. To solve this problem, we propose a cross-language code similarity detection (CLCSD) approach based on code flowcharts. In general, two source code fragments written in different programming languages are transformed into standardized code flowcharts (SCFC), and their similarity is obtained by measuring their corresponding SCFC. More specifically, we first introduce the standardized code flowchart (SCFC) model to be the uniform flowcharts representation of source code written in different languages. SCFC is language-independent, and therefore, it can be used as the intermediate structure for source code similarity detection. Meanwhile, transformation techniques are given to transform source code written in a specific programming language into an SCFC. Second, we propose the SCFC-SPGK algorithm based on the shortest path graph kernel to measure the similarity between two SCFCs. Thus, the similarity between two pieces of source code in different programming languages is given by the similarity between SCFCs. Experimental results show that compared with existing approaches, CLCSD has higher accuracy in cross-language source code similarity detection. Furthermore, CLCSD cannot only handle common source code obfuscation techniques used by students in computer programming teaching but also obtain nearly 90% accuracy in dealing with some complex obfuscation techniques.

Highlights

  • Since the 1970s, the source code similarity detection technique has attracted the attention of global researchers, and it has been widely used in the source code plagiarism detection in computer programming teaching and code intellectual property protection [1]

  • Scientific Programming programming languages, no matter what kind of code transformation or obfuscation techniques is adopted [6], their core processes are highly similar if their programming ideas are the same. is circumstance is close to type IV clones [7]. erefore, aiming at the cross-language source code similarity detection in the teaching of computer programming, we propose a cross-language source code similarity detection approach named cross-language code similarity detection (CLCSD) based on code flowcharts

  • First of all, for two source code fragments written in different programming languages, there may be some differences between their flowcharts that are directly transformed by current code conversion tools even though they have the same processes. is is because the flowcharts obtained by the existing code flowchart conversion approaches and tools are strongly correlated with the syntax of the programming language. erefore, we propose a standardized code flowchart (SCFC) model based on the code flowchart (CFC) and the program dependency graph (PDG) [8]

Read more

Summary

Introduction

Since the 1970s, the source code similarity detection technique has attracted the attention of global researchers, and it has been widely used in the source code plagiarism detection in computer programming teaching and code intellectual property protection [1]. Erefore, aiming at the cross-language source code similarity detection in the teaching of computer programming, we propose a cross-language source code similarity detection approach named CLCSD (cross-language code similarity detection) based on code flowcharts. In this approach, source code written in different programming languages is transformed into corresponding flowcharts, and the similarity of code is obtained by measuring the similarity between their flowcharts. SCFC standardizes the code flowcharts of different languages It is suitable for dealing with the most common code obfuscation techniques in programming assignments.

Related Work
Framework of CLCSD
Structure types
Similarity Measure of SCFCs
Experiment and Evaluation
Findings
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call