Measuring Short Text Reuse for the Urdu Language

Rao Muhammad Adeel Nawab,Paul Rayson,Sara Sameen,Muhammad Sharjeel,Iqra Muneer

doi:10.1109/access.2017.2776842

Rao Muhammad Adeel Nawab, Paul Rayson + Show 3 more

Open Access

https://doi.org/10.1109/access.2017.2776842

Copy DOI

Abstract

Text reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted the attention of the research community due to the wide variety of applications associated with it. To develop and evaluate automatic methods for text reuse detection, standard evaluation resources are required. In this paper, we propose one such resource for a significantly under-resourced language—Urdu, which is widely used in day to day communication and has a large digital footprint particularly in the Indian subcontinent. Our proposed Urdu short text reuse corpus contains 2684 short Urdu text pairs, manually labeled as verbatim (496), paraphrased (1329), and independently written (859). In addition, we describe an evaluation of the corpus using various state-of-the-art text reuse detection methods with binary and multi-classification settings and a set of four classifiers. Output results show that character n-gram overlap using J48 classifier outperform other methods for the Urdu short text reuse detection task.

Highlights

Text reuse is a process in which pre-existing text(s) are reused to generate new text(s) [1], [2]
The corpus is saved in XML format and is freely and publicly available to download for research purposes under a Creative Commons CC-BY-NC-SA licence
We demonstrate how our proposed corpus can be used for the development and evaluation of an Urdu short text reuse detection task, by applying various state-of-the-art text reuse detection methods grouped into four categories, (1) Lexical Methods (Word n-gram Overlap and Vector Space Model), (2) String and Sequence Alignment Methods (Longest Common Subsequence, Greedy String Tiling, Global Alignment, and Local Alignment), (3) Structural Methods (Character n-gram Overlap), and (4) Stylistic Methods (Token Ratio and Type Token Ratio)

Summary

INTRODUCTION

Text reuse is a process in which pre-existing text(s) are reused (verbatim or rewritten) to generate new text(s) [1], [2]. Compare, analyse, and evaluate text reuse detection methods, benchmark corpora are needed. We demonstrate how our proposed corpus can be used for the development and evaluation of an Urdu short text reuse detection task, by applying various state-of-the-art text reuse detection methods grouped into four categories, (1) Lexical Methods (Word n-gram Overlap and Vector Space Model), (2) String and Sequence Alignment Methods (Longest Common Subsequence, Greedy String Tiling, Global Alignment, and Local Alignment), (3) Structural Methods (Character n-gram Overlap), and (4) Stylistic Methods (Token Ratio and Type Token Ratio). To the best of our knowledge, the proposed corpus is the first of its kind that will serve as a benchmark for the future development and evaluation of Urdu short text reuse systems as well as to promote research in a resource-poor language.

RELATED WORK

TEXT REUSE DETECTION METHODS

LEXICAL METHODS

STRING AND SEQUENCE ALIGNMENT METHODS

STRUCTURAL SIMILARITY METHOD

EXPERIMENTAL SET UP

EVALUATION METHODOLOGY

RESULTS AND ANALYSIS

CONCLUSION AND FUTURE WORK