Authorship Attribution for a Resource Poor Language—Urdu

Zulqarnain Nazir,Khawar Mehmood,Waheed Anwar,Khurram Shahzad,Imran Sarwar Bajwa,Muhammad Kamran Malik

doi:10.1145/3487061

Abstract

Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of studies have been conducted for various Western, as well as Asian, languages. However, authorship attribution research in the Urdu language has just begun, although Urdu is widely acknowledged as a prominent South Asian language. Furthermore, the existing studies on authorship attribution in Urdu have addressed a considerably easier problem of having less than 20 candidate authors, which is far from the real-world settings. Therefore, the findings from these studies may not be applicable to the real-world settings. To that end, we have made three key contributions: First, we have developed a large authorship attribution corpus for Urdu, which is a low-resource language. The corpus is composed of over 2.6 million tokens and 21,938 news articles by 94 authors, which makes it a closer substitute to the real-world settings. Second, we have analyzed hundreds of stylometry features used in the literature to identify 194 features that are applicable to the Urdu language and developed a taxonomy of these features. Finally, we have performed 66 experiments using two heterogeneous datasets to evaluate the effectiveness of four traditional and three deep learning techniques. The experimental results show the following: (a) Our developed corpus is many folds larger than the existing corpora, and it is more challenging than its counterparts for the authorship attribution task, and (b) Convolutional Neutral Networks is the most effective technique, as it achieved a nearly perfect F1 score of 0.989 for an existing corpus and 0.910 for our newly developed corpus.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Authorship Attribution for a Resource Poor Language—Urdu

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Dec 13, 2021
Citations: 5

Similar Papers

Automatic authorship attribution in Albanian texts.
Arta Misini ... Endrit Fetahi
PloS one | VOL. 19
Arta Misini, et. al.Arta Misini ... Endrit Fetahi
22 Oct 2024
PloS one | VOL. 19

Explaining Delta, or: How do distance measures for authorship attribution work?
...
Computational Linguistics | VOL. -
, et. al. ...
05 Jun 2015
Computational Linguistics | VOL. -

DAAB: Deep Authorship Attribution in Bengali
Atish Kumar Dipongkor ... Adnan Anwar
-
Atish Kumar Dipongkor, et. al.Atish Kumar Dipongkor ... Adnan Anwar
18 Jul 2021
18 Jul 2021

The effect of author set size and data size in authorship attribution
K Luyckx ... W Daelemans
Literary and Linguistic Computing | VOL. 26
K Luyckx, et. al.K Luyckx ... W Daelemans
16 Aug 2010
Literary and Linguistic Computing | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Authorship Attribution for a Resource Poor Language—Urdu

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing