An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection

Andrianna Polydouri,Andreas Stafylopatis,Georgios Siolas,Eleni Vathi

doi:10.1007/s12530-018-9232-1

Andrianna Polydouri, Andreas Stafylopatis + Show 2 more

https://doi.org/10.1007/s12530-018-9232-1

Copy DOI

Abstract

The ever increasing volume of information due to the widespread use of computers and the web has made effective plagiarism detection methods a necessity. Plagiarism can be found in many settings and forms, in literature, in academic papers, even in programming code. Intrinsic plagiarism detection is the task that deals with the discovery of plagiarized passages in a text document, by identifying the stylistic changes and inconsistencies within the document itself, given that no reference corpus is available. The main idea consists in profiling the style of the original author and marking the passages that seem to differ significantly. In this work, we follow a supervised machine learning classification approach. We consider, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques. Apart from this, we propose some novel stylistic features. We combine our features and imbalanced dataset treatment with various classification methods. Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection shared tasks. It is compared to the best performing detection systems on these datasets, and succeeds the best resulting scores.

Full Text