Abstract

Text reuse is the act of borrowing text from existing documents to create new texts. Freely available and easily accessible large online repositories are not only making reuse of text more common in society but also harder to detect. A major hindrance in the development and evaluation of existing/new mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COrpus of Urdu News TExt Reuse (COUNTER) corpus contains 1200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. We also apply a number of similarity estimation methods on our corpus to show how it can be used for the development, evaluation and comparison of text reuse detection systems for the Urdu language. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language.

Highlights

  • Text reuse occurs when pre-existing text(s) (source(s)) are reused to create a new text

  • The lack of large scale standardized evaluation resources with real examples of text reuse is a major problem in the analyses and development of text reuse detection systems

  • This paper presented our novel contribution in terms of the development of the first mono-lingual text reuse corpus for the Urdu language

Read more

Summary

Introduction

Text reuse occurs when pre-existing text(s) (source(s)) are reused to create a new text (derived). It is the process of reusing someone else’s work by changing its form. Reuse is not limited to text only but ideas, software source code, images and music, are often subjects of reuse, our focus is on text reuse only. As the amount of text that is reused varies, text reuse is commonly classified as either local or global. Sentences or paragraphs are borrowed from the source, it is considered local text reuse whereas when the text from the entire source document(s) is considered to create new document, we name it as global text reuse (Seo and Croft 2008; Mittelbach et al 2010)

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call