A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu

Muhammad Haseeb,Muhammad Faraz Manzoor,Muhammad Shoaib Farooq,Uzma Farooq,Adnan Abid

doi:10.1016/j.dib.2023.109857

Abstract

Plagiarism detection (PD) is a process of identifying instances where someone has presented another person's work or ideas as their own. Plagiarism detection is categorized into two types (i) Intrinsic plagiarism detection primarily concerns the assessment of authorship consistency within a single document, aiming to identify instances where portions of the text may have been copied or paraphrased from elsewhere within the same document. Author clustering, closely related to intrinsic plagiarism detection, involves grouping documents based on their stylistic and linguistic characteristics to identify common authors or sources within a given dataset. On the other hand, (ii) extrinsic plagiarism detection delves into the comparative analysis of a suspicious document against a set of external source documents, seeking instances of shared phrases, sentences, or paragraphs between them, which is often referred to as text reuse or verbatim copying. Detection of plagiarism from documents is a long-established task in the area of NLP with remarkable contributions in multiple applications. A lot of research has already been conducted in the English and other foreign languages but Urdu language needs a lot of attention especially in intrinsic plagiarism detection domain. The major reason is that Urdu is a low resource language and unfortunately there is no high-quality benchmark corpus available for intrinsic plagiarism detection in Urdu language. This study presents a high-quality benchmark Corpus comprising 10,872 documents. The corpus is structured into two granularity levels: sentence level and paragraph level. This dataset serves multifaceted purposes, facilitating intrinsic plagiarism detection, verbatim text reuse identification, and author clustering in the Urdu language. Also, it holds significance for natural language processing researchers and practitioners as it facilitates the development of specialized plagiarism detection models tailored to the Urdu language. These models can play a vital role in education and publishing by improving the accuracy of plagiarism detection, effectively addressing a gap and enhancing the overall ability to identify copied content in Urdu writing.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu

Abstract

Talk to us

Similar Papers

More From: Data in Brief

Lead the way for us

Journal: Data in Brief	Publication Date: Nov 26, 2023
License type: cc-by

Similar Papers

Mono- and cross-lingual paraphrased text reuse and extrinsic plagiarism detection

-

24 Jun 2020
24 Jun 2020

An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection
Andrianna Polydouri ... Eleni Vathi
Evolving Systems | VOL. 11
Andrianna Polydouri, et. al.Andrianna Polydouri ... Eleni Vathi
13 Jul 2018
Evolving Systems | VOL. 11

Analysis of Stylometric Features and Segmentation Strategies in Intrinsic Plagiarism Detection System
Sylvia Putri Gunawan ... Lucia Dwi Krisnawati
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) | VOL. 4
Sylvia Putri Gunawan, et. al. Sylvia Putri Gunawan ... Lucia Dwi Krisnawati
30 Oct 2020
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) | VOL. 4

On the mono- and cross-language detection of text reuse and plagiarism
Alberto Barrón-Cedeño
-
Alberto Barrón-CedeñoAlberto Barrón-Cedeño
19 Jul 2010
19 Jul 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu

Abstract

Talk to us

Similar Papers

More From: Data in Brief