Abstract

Source code plagiarism is an emerging issue in computer science education. As a result, a number of techniques have been proposed to handle this issue. However, comparing these techniques may be challenging, since they are evaluated with their own private dataset(s). This paper contributes in providing a public dataset for comparing these techniques. Specifically, the dataset is designed for evaluation with an Information Retrieval (IR) perspective. The dataset consists of 467 source code files, covering seven introductory programming assessment tasks. Unique to this dataset, both intention to plagiarise and advanced plagiarism attacks are considered in its construction. The dataset's characteristics were observed by comparing three IR-based detection techniques, and it is clear that most IR-based techniques are less effective than a baseline technique which relies on Running-Karp-Rabin Greedy-String-Tiling, even though some of them are far more time-efficient.

Highlights

  • Source code plagiarism is an act of reusing other people’s code with no acknowledgement toward the original works (Cosma & Joy, 2008)

  • This paper presents a dataset for source code plagiarism detection

  • This paper proposes a dataset for evaluating source code plagiarism detection techniques from an Information Retrieval (IR) perspective, where the plagiarised cases are formed with the intention of plagiarising and contain advanced plagiarism attacks in addition to simple ones

Read more

Summary

Introduction

Source code plagiarism is an act of reusing other people’s code with no (or improper) acknowledgement toward the original works (Cosma & Joy, 2008) It is an emerging issue in Computer Science (CS) education (Simon et al, 2018); as grades may fail to reflect the students’ real capabilities. Some of them focus more on effectiveness factors (such as accuracy and the capability to detect complex modification) while the others focus on the efficiency (such as processing time) They can be challenging to compare with each other, since most techniques are evaluated using their own dataset and the dataset is not publicly accessible. The datasets (that are stored on the corresponding author’s local repository) may be missing or corrupted due to technical problems Another important issue related to existing datasets is that some of them may not represent real plagiarism cases. The dataset characteristics were observed by comparing detection techniques derived from three popular IR retrieval models: the Vector Space Model, Latent Semantic Indexing, and the Language Model (Croft, Metzler, & Strohman, 2010)

A Review of Automated Source Code Plagiarism Detection
The Dataset
326 3.1. Methodology
Result
Methodology
Baseline Analysis
LM-Oriented Analysis
Summary
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call