Abstract

SummaryPlagiarism is becoming an increasingly serious problem in academic environment. In this paper, we deal with a specific kind of plagiarism: source code plagiarism. In this case, there is no software available for detecting plagiarism on a larger scale (hundreds of student submissions every year). We propose algorithms for source code parsing and processing as a part of a complex system for plagiarism detection. A source code vectorization using characteristic vectors is a vital piece of the whole process, and k‐means algorithm helps with the classification and clustering of vectors. Student assignments are submitted regularly, and any plagiarism detection system needs to handle them as they come. For this reason, we propose a modified incremental k‐means algorithm and a method for determining the number of clusters. We also consider methods for vector search among clusters and suggest the use of conditional entropy to select the important vector elements used in the search algorithm. Our results show how the proposed algorithms and methods work on real student submissions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call