Compiling a text re-use detection corpus from scientific papers with semi-real cases of plagiarism

Salar Mohtaj,Vahid Zarrabi,Habibollah Asghari

doi:10.1109/ialp.2017.8300585

Abstract

Automatic plagiarism detection deals with retrieval of reused fragment of texts in a document and finding source documents. Due to development of various methods for plagiarism detection, large scale plagiarism corpora are needed to evaluate these methods. Despite of their importance, few plagiarism detection corpora developed in recent years, especially in low resource languages. Because of legal issues, releasing a collection of real cases of plagiarism for evaluation purposes is not ethical. Due to these limitations, simulation and artificial based methods are the two main approaches to compile a plagiarism corpus. These approaches try to simulate real cases of plagiarism, from different point of views. However, there are still fundamental differences between simulated corpora and real cases of plagiarism. In this paper a semi-real approach is proposed to create a collection of plagiarism cases as a corpus. This approach is based on eliminating correct references from scientific papers to make them as plagiarized passages. Unlike methods based on simulated and artificial approaches, the proposed corpus can correctly simulate real cases of text re-use. The evaluation result shows high accuracy of proposed corpus with respect to n-gram similarity for different ranges of N.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Compiling a text re-use detection corpus from scientific papers with semi-real cases of plagiarism

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

On the mono- and cross-language detection of text reuse and plagiarism
Alberto Barrón-Cedeño
-
Alberto Barrón-CedeñoAlberto Barrón-Cedeño
19 Jul 2010
19 Jul 2010

Mono- and cross-lingual paraphrased text reuse and extrinsic plagiarism detection

-

24 Jun 2020
24 Jun 2020

A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
Muhammad Haseeb ... Adnan Abid
Data in Brief | VOL. 52
Muhammad Haseeb, et. al.Muhammad Haseeb ... Adnan Abid
26 Nov 2023
Data in Brief | VOL. 52

The Study of Plagiarism Detection for Program Code
Hao Jiang ... Zhemin Jiang
-
Hao Jiang, et. al.Hao Jiang ... Zhemin Jiang
01 Jan 2010
01 Jan 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Compiling a text re-use detection corpus from scientific papers with semi-real cases of plagiarism

Abstract

Talk to us

Similar Papers