Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation

Oscar Karnalim,Setia Budi,Hapnes Toba,Mike Joy

doi:10.15388/infedu.2019.15

Oscar Karnalim, Setia Budi + Show 2 more

Open Access

PDF Available

https://doi.org/10.15388/infedu.2019.15

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Source code plagiarism is an emerging issue in computer science education. As a result, a number of techniques have been proposed to handle this issue. However, comparing these techniques may be challenging, since they are evaluated with their own private dataset(s). This paper contributes in providing a public dataset for comparing these techniques. Specifically, the dataset is designed for evaluation with an Information Retrieval (IR) perspective. The dataset consists of 467 source code files, covering seven introductory programming assessment tasks. Unique to this dataset, both intention to plagiarise and advanced plagiarism attacks are considered in its construction. The dataset's characteristics were observed by comparing three IR-based detection techniques, and it is clear that most IR-based techniques are less effective than a baseline technique which relies on Running-Karp-Rabin Greedy-String-Tiling, even though some of them are far more time-efficient.

Highlights

Source code plagiarism is an act of reusing other people’s code with no acknowledgement toward the original works (Cosma & Joy, 2008)
This paper presents a dataset for source code plagiarism detection
This paper proposes a dataset for evaluating source code plagiarism detection techniques from an Information Retrieval (IR) perspective, where the plagiarised cases are formed with the intention of plagiarising and contain advanced plagiarism attacks in addition to simple ones

Summary

Introduction

Source code plagiarism is an act of reusing other people’s code with no (or improper) acknowledgement toward the original works (Cosma & Joy, 2008) It is an emerging issue in Computer Science (CS) education (Simon et al, 2018); as grades may fail to reflect the students’ real capabilities. Some of them focus more on effectiveness factors (such as accuracy and the capability to detect complex modification) while the others focus on the efficiency (such as processing time) They can be challenging to compare with each other, since most techniques are evaluated using their own dataset and the dataset is not publicly accessible. The datasets (that are stored on the corresponding author’s local repository) may be missing or corrupted due to technical problems Another important issue related to existing datasets is that some of them may not represent real plagiarism cases. The dataset characteristics were observed by comparing detection techniques derived from three popular IR retrieval models: the Vector Space Model, Latent Semantic Indexing, and the Language Model (Croft, Metzler, & Strohman, 2010)

A Review of Automated Source Code Plagiarism Detection

The Dataset

326 3.1. Methodology

Result

Methodology

Baseline Analysis

LM-Oriented Analysis

Summary

Findings

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Informatics in Education	Publication Date: Oct 16, 2019
Citations: 20	License type: cc-by

R Discovery Prime

Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Informatics in Education

Lead the way for us

Similar Papers

Academic Source Code Plagiarism Detection by Measuring Program Behavioral Similarity
Hayden Cheers ... Yuqing Lin
IEEE Access | VOL. 9
Hayden Cheers, et. al.Hayden Cheers ... Yuqing Lin
01 Jan 2020
IEEE Access | VOL. 9

A state of art on source code plagiarism detection
Mayank Agrawal ... Dilip Kumar Sharma
-
Mayank Agrawal, et. al.Mayank Agrawal ... Dilip Kumar Sharma
01 Oct 2016
01 Oct 2016

SPPlagiarise: A Tool for Generating Simulated Semantics-Preserving Plagiarism of Java Source Code
Hayden Cheers ... Shamus P Smith
-
Hayden Cheers, et. al.Hayden Cheers ... Shamus P Smith
01 Oct 2019
01 Oct 2019

Detecting Pervasive Source Code Plagiarism through Dynamic Program Behaviours
Hayden Cheers ... Shamus P Smith
-
Hayden Cheers, et. al.Hayden Cheers ... Shamus P Smith
03 Feb 2020
03 Feb 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Informatics in Education