Abstract

Cross-lingual plagiarism occurs when the source (or original) text(s) is in one language and the plagiarized text is in another language. In recent years, cross-lingual plagiarism detection has attracted the attention of the research community because a large amount of digital text is easily accessible in many languages through online digital repositories and machine translation systems are readily available, making it easier to perform cross-lingual plagiarism and harder to detect it. To develop and evaluate cross-lingual plagiarism detection systems, standard evaluation resources are needed. The majority of earlier studies have developed cross-lingual plagiarism corpora for English and other European language pairs. However, for Urdu-English language pair, the problem of cross-lingual plagiarism detection has not been thoroughly explored although a large amount of digital text is readily available in Urdu and it is spoken in many countries of the world (particularly in Pakistan, India, and Bangladesh). To fulfill this gap, this paper presents a large benchmark cross-lingual corpus for Urdu-English language pair. The proposed corpus contains 2,395 source-suspicious document pairs (540 are automatic translation, 539 are artificially paraphrased, 508 are manually paraphrased, and 808 are nonplagiarized). Furthermore, our proposed corpus contains three types of cross-lingual examples including artificial (automatic translation and artificially paraphrased), simulated (manually paraphrased), and real (nonplagiarized), which have not been previously reported in the development of cross-lingual corpora. Detailed analysis of our proposed corpus was carried out using n-gram overlap and longest common subsequence approaches. Using Word unigrams, mean similarity scores of 1.00, 0.68, 0.52, and 0.22 were obtained for automatic translation, artificially paraphrased, manually paraphrased, and nonplagiarized documents, respectively. These results show that documents in the proposed corpus are created using different obfuscation techniques, which makes the dataset more realistic and challenging. We believe that the corpus developed in this study will help to foster research in an underresourced language of Urdu and will be useful in the development, comparison, and evaluation of cross-lingual plagiarism detection systems for Urdu-English language pair. Our proposed corpus is free and publicly available for research purposes.

Highlights

  • In cross-lingual plagiarism, a piece of text in one language is translated into another language by neither changing the semantics and content nor referring the origin [1, 2]

  • Wikipedia contains articles in more than 200 languages on same topics. irdly, people might be often interested to write in another language which is different from their native language

  • We found that two of them have the highest number of visitors per day: (1) Spinbot text rewriting tool with an average number of 26 k visitors per day and (2) Article Rewriter text rewriting tool with an average number of 45 k visitors per day reported by Alexa (this is a ranking system set by alexa.com that basically audits and makes public the frequency of visits on various websites) as compared to other tools like http://paraphrasing-tool.com/, etc

Read more

Summary

Introduction

In cross-lingual plagiarism, a piece of text in one (or source) language is translated into another (or target) language by neither changing the semantics and content nor referring the origin [1, 2]. Cross-lingual plagiarism detection is a challenging research problem due to various reasons. Machine translation systems are available online free of cost such as Google Translator (https://translate.google.com/) to translate a document written in one language into another language. Irdly, people might be often interested to write in another language which is different from their native language. Wikipedia contains articles in more than 200 languages on same topics All these factors contribute to an environment, which makes it easier to commit cross-lingual plagiarism and difficult to detect it

Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call