Abstract

Automatic plagiarism detection deals with retrieval of reused fragment of texts in a document and finding source documents. Due to development of various methods for plagiarism detection, large scale plagiarism corpora are needed to evaluate these methods. Despite of their importance, few plagiarism detection corpora developed in recent years, especially in low resource languages. Because of legal issues, releasing a collection of real cases of plagiarism for evaluation purposes is not ethical. Due to these limitations, simulation and artificial based methods are the two main approaches to compile a plagiarism corpus. These approaches try to simulate real cases of plagiarism, from different point of views. However, there are still fundamental differences between simulated corpora and real cases of plagiarism. In this paper a semi-real approach is proposed to create a collection of plagiarism cases as a corpus. This approach is based on eliminating correct references from scientific papers to make them as plagiarized passages. Unlike methods based on simulated and artificial approaches, the proposed corpus can correctly simulate real cases of text re-use. The evaluation result shows high accuracy of proposed corpus with respect to n-gram similarity for different ranges of N.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.