Academic Source Code Plagiarism Detection by Measuring Program Behavioral Similarity

Hayden Cheers,Yuqing Lin,Shamus P Smith

doi:10.1109/access.2021.3069367

Hayden Cheers, Yuqing Lin + Show 1 more

Open Access

https://doi.org/10.1109/access.2021.3069367

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 69	License type: CC BY 4.0

Affiliation: University of Newcastle Australia

Abstract

Source code plagiarism is a long-standing issue in tertiary computer science education. Many source code plagiarism detection tools have been proposed to aid in the detection of source code plagiarism. However, existing detection tools are not robust to pervasive plagiarism-hiding transformations, and as a result can be inaccurate in the detection of plagiarised source code. This article presents BPlag, a behavioural approach to source code plagiarism detection. BPlag is designed to be both robust to pervasive plagiarism-hiding transformations, and accurate in the detection of plagiarised source code. Greater robustness and accuracy is afforded by analysing the behaviour of a program, as behaviour is perceived to be the least susceptible aspect of a program impacted upon by plagiarism-hiding transformations. BPlag applies symbolic execution to analyse execution behaviour and represent a program in a novel graph-based format. Plagiarism is then detected by comparing these graphs and evaluating similarity scores. BPlag is evaluated for robustness, accuracy and efficiency against 5 commonly used source code plagiarism detection tools. It is then shown that BPlag is more robust to plagiarism-hiding transformations and more accurate in the detection of plagiarised source code, but is less efficient than compared tools.

Highlights

Plagiarism is a long-standing issue in academic institutions
If two Program Interaction Dependency Graph [42] (PIDG) are of vastly different sizes, but one is a subset of the other, it can result in a false positive through the evaluation of an unexpectedly high similarity score
4) RESULTS Table 8 lists the error counts for each Source Code Plagiarism Detection Tools (SCPDTs) for detecting the simulated plagiarism

Summary

Introduction

Studies have indicated between 50% to 79% of undergraduate students will plagiarise at least once during their academic careers [1]–[4]. With such a high rate, it is highly likely that an academic will have to assess a suspected case of plagiarism. Structural approaches measure similarity by identifying common structures in source code. In its most basic form, structural similarity can be measured with textual strings This is by applying techniques such as string edit distance or string alignment to measure the similarity of source code [7], [15]–[17]. Other approaches measure the structural similarity of parse trees or abstract syntax trees, representing the source code within the grammar of a programming language [22]–[24]

Methods

Results

Conclusion