Intrinsic Plagiarism Detection with Feature-Rich Imbalanced Dataset Learning

Andrianna Polydouri,Georgios Siolas,Andreas Stafylopatis

doi:10.1007/978-3-319-65172-9_9

Abstract

In the context of intrinsic plagiarism detection, we are trying to discover plagiarised passages in a text, based on the stylistic changes and inconsistencies within the document itself. The main idea consists in profiling the style of the original author and marking as outliers the passages that seem to differ significantly. Besides some novel stylistic and semantic features, the present work proposes a new approach to the problem, where machine learning plays a significant role. Notably, we also consider, for the first time, the reality of unbalanced training dataset in intrinsic plagiarism detection as a major parameter of the problem. Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection’s shared tasks of 2009 and 2011 and is compared to the results of the highest score participations.

Full Text