Abstract

Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique ( F 1 = 0.855). Our corpus is available and free to download for research purposes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call