Abstract

Source code reuse detection has become of growing significance as a common plagiarism prevention practice in academic research. For a large collection of source codes, the manual detection of the code reuse seems impractical, and there is a vital need for automatic and highly accurate tools. This paper introduces a structure-based approach for recognizing source code (SOCO) reuse in reference programs. The proposed model consists of the three main phases; preprocessing, sequence generation, and decision-making based on estimated similarities. Firstly, important instructions in each code file are identified, and source code is converted to a string of specific tokens. A sequence alignment process is then carried out, and the tree representation of the source code is constructed. In the third phase, the similarity values among the code files are estimated using three different innovative strategies based on both lexical and structural comparison of source codes. Finally, the system decides on each pair of files. The SOCO-2014 corpus is used for evaluating the method. The comparative experimental results of our model and that of the contest participants indicate that our proposed method’s performance is acceptable and promising.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call