Pertinence of Lexical and Structural Features for Plagiarism Detection in Source Code

Gabriela Ramírez-De-La-Rosa ,Christian Sánchez-Sánchez ,Esaú Villatoro-Tello ,Aarón Ramírez-De-La-Cruz ,Héctor Jiménez-Salazar

doi:10.13053/rcs-85-1-2

Gabriela Ramírez-De-La-Rosa , Christian Sánchez-Sánchez + Show 3 more

Open Access

https://doi.org/10.13053/rcs-85-1-2

Copy DOI

Abstract

Source code plagiarism can be identified by analyzing several and diverse views of a pair of source code. In this paper we present three representations from lexical and structural views of a given source code. We attempt to show that different representations provide diverse information that can be useful to identify plagiarism. In particular, we present representations based on 3-grams of characters, data type of function's signatures and names of the identifiers of function's signatures. While we used only three representations, more representations can be added. We conducted our analysis over a collection of 79 source code written in C language. Our results show that n-gram representation is very informative, but also that representations taken from the function's signatures are, to some extend, complementaries.

Full Text