Text reuse is the process of creating new texts from pre-existing ones. In recent years, Urdu Text Reuse Detection (U-TRD) has garnered the attention of researchers due to the ready availability of digital text all over the internet, which can be copied or paraphrased from other sources without proper attribution, making it easier to reuse but challenging to detect. Previous studies have explored the issue of U-TRD at the phrasal, sentence/passage, and document levels, using benchmark corpora and methods. However, the problem of U-TRD has not been investigated at the lexical level in terms of corpora and methods. To address this research gap, our study has developed a large benchmark corpus manually annotated at the lexical level. This corpus consists of 22,184 text pairs categorized into two levels of rewrite: (1) Derived (8,660) and (2) Non-Derived (13,524). Additionally, our research has involved the development, application, evaluation, and comparison of a range of methods, including baseline methods (uni-gram overlap and word embedding-based methods), along with state-of-the-art transformer-based methods and feature-fusion-based methods, using the proposed UTRD-Lex-23 corpus. Our study concludes that one of our proposed feature-fusion methods outperforms all other methods. The model we propose, which combines seven different Sentence Transformers (ST) (each producing 768 dimension vectors) with one uni-gram (at word level) and sixteen different features extracted from four different Word Embedding (WE) based models (yielding 300 dimension vectors), achieves an F1 score of 0.70601 using 10-fold cross validation. To foster and promote research in Urdu (a low-resourced language) proposed corpus will be freely and publicly available for research purposes.
Read full abstract