Source Code Similarity Research Articles

Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.

Read full abstract

Source code plagiarism is an easy to do task, but very difficult to detect without proper tool support. Various source code similarity detection systems have been developed to help detect source code plagiarism. Those systems need to recognize a number of lexical and structural source code modifications. For example, by some structural modifications (e.g. modification of control structures, modification of data structures or structural redesign of source code) the source code can be changed in such a way that it almost looks genuine. Most of the existing source code similarity detection systems can be confused when these structural modifications have been applied to the original source code. To be considered effective, a source code similarity detection system must address these issues. To address them, we designed and developed the source code similarity system for plagiarism detection. To demonstrate that the proposed system has the desired effectiveness, we performed a well-known conformism test. The proposed system showed promising results as compared with the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code. As a confirmation of these results, an independent samples t-test revealed that there was a statistically significant difference between average values of F-measures for the test sets that we used and for the experiments that we have done in the practically usable range of cut-off threshold values of 35–70%.

Read full abstract

Source Code Similarity Research Articles

Related Topics

Articles published on Source Code Similarity

Automated Hints Generation for Investigating Source Code Plagiarism and Identifying The Culprits on In-Class Individual Programming Assessment

Similarity Code File Detection Model Based on Frequent Itemsets

An Approach to Source Code Plagiarism Detection Based on Abstract Implementation Structure Diagram

A comparison of code similarity analysers

A novel kernel to predict software defectiveness

Computation of Program Source Code Similarity by Composition of Parse Tree and Call Graph

A Token Oriented Measurement Method of Source Code Similarity

Multi-Agent based Sequence Algorithm for Detecting Plagiarism and Clones in Java Source Code using Abstract Syntax Tree

A Source Code Similarity System for Plagiarism Detection

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Source Code Similarity Research Articles

Related Topics

Articles published on Source Code Similarity

Automated Hints Generation for Investigating Source Code Plagiarism and Identifying The Culprits on In-Class Individual Programming Assessment

Similarity Code File Detection Model Based on Frequent Itemsets

An Approach to Source Code Plagiarism Detection Based on Abstract Implementation Structure Diagram

A comparison of code similarity analysers

A novel kernel to predict software defectiveness

Computation of Program Source Code Similarity by Composition of Parse Tree and Call Graph

A Token Oriented Measurement Method of Source Code Similarity

Multi-Agent based Sequence Algorithm for Detecting Plagiarism and Clones in Java Source Code using Abstract Syntax Tree

A Source Code Similarity System for Plagiarism Detection