Authorship Attribution for a Resource Poor Language—Urdu

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of studies have been conducted for various Western, as well as Asian, languages. However, authorship attribution research in the Urdu language has just begun, although Urdu is widely acknowledged as a prominent South Asian language. Furthermore, the existing studies on authorship attribution in Urdu have addressed a considerably easier problem of having less than 20 candidate authors, which is far from the real-world settings. Therefore, the findings from these studies may not be applicable to the real-world settings. To that end, we have made three key contributions: First, we have developed a large authorship attribution corpus for Urdu, which is a low-resource language. The corpus is composed of over 2.6 million tokens and 21,938 news articles by 94 authors, which makes it a closer substitute to the real-world settings. Second, we have analyzed hundreds of stylometry features used in the literature to identify 194 features that are applicable to the Urdu language and developed a taxonomy of these features. Finally, we have performed 66 experiments using two heterogeneous datasets to evaluate the effectiveness of four traditional and three deep learning techniques. The experimental results show the following: (a) Our developed corpus is many folds larger than the existing corpora, and it is more challenging than its counterparts for the authorship attribution task, and (b) Convolutional Neutral Networks is the most effective technique, as it achieved a nearly perfect F1 score of 0.989 for an existing corpus and 0.910 for our newly developed corpus.

Similar Papers
  • Research Article
  • Cite Count Icon 4
  • 10.1371/journal.pone.0310057
Automatic authorship attribution in Albanian texts.
  • Oct 22, 2024
  • PloS one
  • Arta Misini + 3 more

Automatic authorship identification is a challenging task that has been the focus of extensive research in natural language processing. Regardless of the progress made in attributing authorship, the need for corpora in under-resourced languages impedes advancing and examining present methods. To address this gap, we investigate the problem of authorship attribution in Albanian. We introduce a newly compiled corpus of Albanian newsroom columns and literary works and analyze machine-learning methods for detecting authorship. We create a set of hand-crafted features targeting various categories (lexical, morphological, and structural) relevant to Albanian and experiment with multiple classifiers using two different multiclass classification strategies. Furthermore, we compare our results to those obtained using deep learning models. Our investigation focuses on identifying the best combination of features and classification methods. The results reveal that lexical features are the most effective set of linguistic features, significantly improving the performance of various algorithms in the authorship attribution task. Among the machine learning algorithms evaluated, XGBoost demonstrated the best overall performance, achieving an F1 score of 0.982 on literary works and 0.905 on newsroom columns. Additionally, deep learning models such as fastText and BERT-multilingual showed promising results, highlighting their potential applicability in specific scenarios in Albanian writings. These findings contribute to the understanding of effective methods for authorship attribution in low-resource languages and provide a robust framework for future research in this area. The careful analysis of the different scenarios and the conclusions drawn from the results provide valuable insights into the potential and limitations of the methods and highlight the challenges in detecting authorship in Albanian. Promising results are reported, with implications for improving the methods used in Albanian authorship attribution. This study provides a valuable resource for future research and a reference for researchers in this domain.

  • Research Article
  • 10.1371/journal.pone.0310057.r004
Automatic authorship attribution in Albanian texts
  • Oct 22, 2024
  • PLOS ONE
  • Arta Misini + 6 more

Automatic authorship identification is a challenging task that has been the focus of extensive research in natural language processing. Regardless of the progress made in attributing authorship, the need for corpora in under-resourced languages impedes advancing and examining present methods. To address this gap, we investigate the problem of authorship attribution in Albanian. We introduce a newly compiled corpus of Albanian newsroom columns and literary works and analyze machine-learning methods for detecting authorship. We create a set of hand-crafted features targeting various categories (lexical, morphological, and structural) relevant to Albanian and experiment with multiple classifiers using two different multiclass classification strategies. Furthermore, we compare our results to those obtained using deep learning models. Our investigation focuses on identifying the best combination of features and classification methods. The results reveal that lexical features are the most effective set of linguistic features, significantly improving the performance of various algorithms in the authorship attribution task. Among the machine learning algorithms evaluated, XGBoost demonstrated the best overall performance, achieving an F1 score of 0.982 on literary works and 0.905 on newsroom columns. Additionally, deep learning models such as fastText and BERT-multilingual showed promising results, highlighting their potential applicability in specific scenarios in Albanian writings. These findings contribute to the understanding of effective methods for authorship attribution in low-resource languages and provide a robust framework for future research in this area. The careful analysis of the different scenarios and the conclusions drawn from the results provide valuable insights into the potential and limitations of the methods and highlight the challenges in detecting authorship in Albanian. Promising results are reported, with implications for improving the methods used in Albanian authorship attribution. This study provides a valuable resource for future research and a reference for researchers in this domain.

  • Research Article
  • Cite Count Icon 3
  • 10.5281/zenodo.50899
Explaining Delta, or: How do distance measures for authorship attribution work?
  • Jun 5, 2015
  • Computational Linguistics
  • Stefan Evert + 5 more

Authorship Attribution is a research area in quantitative text analysis concerned with attributing texts of unknown or disputed authorship to their actual author based on quantitatively measured linguistic evidence (see Juola 2006; Stamatatos 2009; Koppel et al. 2009). Authorship attribution has applications in literary studies, history, forensics and many other fields, e.g. corpus stylistics (Oakes 2009). The fundamental assumption in authorship attribution is that individuals have idiosyncratic habits of language use, leading to a stylistic similarity of texts written by the same person. Many of these stylistic habits can be measured by assessing the relative frequencies of function words or parts of speech, vocabulary richness, and many other linguistic features. Distance metrics between the resulting feature vectors indicate the overall similarity of texts to each other, and can be used for attributing a text of unknown authorship to the most similar of a (usually closed) set of candidate authors. The aim of this paper is to present findings from a larger investigation of authorship attribution methods which centres around the following questions: (a) How and why exactly does authorship attribution based on distance measures work? (b) Why do different distance measures and normalization strategies perform differently? (c) Specifically, why do they perform differently for different languages and language families, and (d) How can such knowledge be used to improve authorship attribution methods? First, we describe current issues in authorship attribution and contextualize our own work. Second, we report some of our earlier research into the question. Then, we present our most recent investigation, which pertains to the effects of normalization methods and distance measures in different languages, describing our aims, data and methods. We conclude with a summary of our results.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/ijcnn52387.2021.9533619
DAAB: Deep Authorship Attribution in Bengali
  • Jul 18, 2021
  • Atish Kumar Dipongkor + 6 more

Authorship attribution identifies the true author of an unknown document. Authorship attribution plays a crucial role in plagiarism detection and blackmailer identification, however, the existing studies on authorship attribution in Bengali are limited. In this paper, we propose an instance-based deep authorship attribution model, called DAAB, to identify authors in Bengali. Our DAAB model fuses features from convolutional neural networks and another set of features from an artificial neural network to learn the stylometry of an author for authorship attribution. Extensive experiments with three real benchmark datasets such as Bengali-Quora and two online Bengali Corpus demonstrate the superiority of our authorship attribution model.

  • Research Article
  • Cite Count Icon 155
  • 10.1093/llc/fqq013
The effect of author set size and data size in authorship attribution
  • Aug 16, 2010
  • Literary and Linguistic Computing
  • K Luyckx + 1 more

Applications of authorship attribution 'in the wild' (Koppel, M., Schler, J., and Argamon, S. (2010). Authorship attribution in the wild. Language Resources and Evaluation. Advanced Access published January 12, 2010:10.1007/ s10579-009-9111-2), for instance in social networks, will likely involve large sets of candidate authors and only limited data per author. In this article, we present the results of a systematic study of two important parameters in super- vised machine learning that significantly affect performance in computational authorship attribution: (1) the number of candidate authors (i.e. the number of classes to be learned), and (2) the amount of training data available per can- didate author (i.e. the size of the training data). We also investigate the robust- ness of different types of lexical and linguistic features to the effects of author set size and data size. The approach we take is an operationalization of the standard text categorization model, using memory-based learning for discriminating be- tween the candidate authors. We performed authorship attribution experiments on a set of three benchmark corpora in which the influence of topic could be controlled. The short text fragments of e-mail length present the approach with a true challenge. Results show that, as expected, authorship attribution accuracy deteriorates as the number of candidate authors increases and size of training data decreases, although the machine learning approach continues performing significantly above chance. Some feature types (most notably character n-grams) are robust to changes in author set size and data size, but no robust individual features emerge.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 16
  • 10.1109/access.2018.2869198
An Effective and Scalable Framework for Authorship Attribution Query Processing
  • Jan 1, 2018
  • IEEE Access
  • Raheem Sarwar + 8 more

Authorship attribution aims at identifying the original author of an anonymous text from a given set of candidate authors and has a wide range of applications. The main challenge in authorship attribution problem is that the real-world applications tend to have hundreds of authors , while each author may have a small number of text samples, e.g., 5–10 texts/author. As a result, building a predictive model that can accurately identify the author of an anonymous text is a challenging task. In fact, existing authorship attribution solutions based on long text focus on application scenarios, where the number of candidate authors is limited to 50. These solutions generally report a significant performance reduction as the number of authors increases. To overcome this challenge, we propose a novel data representation model that captures stylistic variations within each document, which transforms the problem of authorship attribution into a similarity search problem. Based on this data representation model, we also propose a similarity query processing technique that can effectively handle outliers. We assess the accuracy of our proposed method against the state-of-the-art authorship attribution methods using real-world data sets extracted from Project Gutenberg. Our data set contains 3000 novels from 500 authors. Experimental results from this paper show that our method significantly outperforms all competitors. Specifically, as for the closed-set and open-set authorship attribution problems, our method have achieved higher than 95% accuracy.

  • Research Article
  • Cite Count Icon 10
  • 10.32604/cmc.2022.025543
Deep Learning and Machine Learning-Based Model for Conversational Sentiment Classification
  • Jan 1, 2022
  • Computers, Materials & Continua
  • Sami Ullah + 4 more

In the current era of the internet, people use online media for conversation, discussion, chatting, and other similar purposes. Analysis of such material where more than one person is involved has a spate challenge as compared to other text analysis tasks. There are several approaches to identify users’ emotions from the conversational text for the English language, however regional or low resource languages have been neglected. The Urdu language is one of them and despite being used by millions of users across the globe, with the best of our knowledge there exists no work on dialogue analysis in the Urdu language. Therefore, in this paper, we have proposed a model which utilizes deep learning and machine learning approaches for the classification of users’ emotions from the text. To accomplish this task, we have first created a dataset for the Urdu language with the help of existing English language datasets for dialogue analysis. After that, we have preprocessed the data and selected dialogues with common emotions. Once the dataset is prepared, we have used different deep learning and machine learning techniques for the classification of emotion. We have tuned the algorithms according to the Urdu language datasets. The experimental evaluation has shown encouraging results with 67% accuracy for the Urdu dialogue datasets, more than 10, 000 dialogues are classified into five emotions i.e., joy, fear, anger, sadness, and neutral. We believe that this is the first effort for emotion detection from the conversational text in the Urdu language domain.

  • Research Article
  • Cite Count Icon 14
  • 10.1016/j.procs.2015.04.110
Influence of Lexical, Syntactic and Structural Features and their Combination on Authorship Attribution for Telugu Text
  • Jan 1, 2015
  • Procedia Computer Science
  • S Naga Prasad + 3 more

Influence of Lexical, Syntactic and Structural Features and their Combination on Authorship Attribution for Telugu Text

  • Research Article
  • Cite Count Icon 319
  • 10.1007/s10579-009-9111-2
Authorship attribution in the wild
  • Jan 13, 2010
  • Language Resources and Evaluation
  • Moshe Koppel + 2 more

Most previous work on authorship attribution has focused on the case in which we need to attribute an anonymous document to one of a small set of candidate authors. In this paper, we consider authorship attribution as found in the wild: the set of known candidates is extremely large (possibly many thousands) and might not even include the actual author. Moreover, the known texts and the anonymous texts might be of limited length. We show that even in these difficult cases, we can use similarity-based methods along with multiple randomized feature sets to achieve high precision. Moreover, we show the precise relationship between attribution precision and four parameters: the size of the candidate set, the quantity of known-text by the candidates, the length of the anonymous text and a certain robustness score associated with a attribution.

  • Research Article
  • Cite Count Icon 16
  • 10.2478/seeur-2022-0100
A Survey on Authorship Analysis Tasks and Techniques
  • Dec 1, 2022
  • SEEU Review
  • Arta Misini + 2 more

Authorship Analysis (AA) is a natural language processing field that examines the previous works of writers to identify the author of a text based on its features. Studies in authorship analysis include authorship identification, authorship profiling, and authorship verification. Due to its relevance, to many applications in this field attention has been paid. It is widely used in the attribution of historical literature. Other applications include legal linguistics, criminal law, forensic investigations, and computer forensics. This paper aims to provide an overview of the work done and the techniques applied in the authorship analysis domain. The examination of recent developments in this field is the principal focus. Many different criteria can be used to define a writer’s style. This paper investigates stylometric features in different author-related tasks, including lexical, syntactic, semantic, structural, and content-specific ones. A lot of classification methods have been applied to authorship analysis tasks. We examine many research studies that use different machine learning and deep learning techniques. As a means of pointing the direction for future studies, we present the most relevant methods recently proposed. The reviewed studies include documents of different types and different languages. In summary, due to the fact that each natural language has its own set of features, there is no standard technique generically applicable for solving the AA problem.

  • Research Article
  • 10.7592/tertium.2024.9.1.295
Authorship Analysis, Social Networks and Catalan
  • Nov 15, 2024
  • Półrocznik Językoznawczy Tertium
  • Elga Cremades

This paper presents data that can help determine whether factors such as gender or age can become significant in authorship analysis through X in Catalan. Considering the principles of forensic linguistics (in particular, authorship analysis and idiolectal style), 500 publications have been analyzed from a stylistic point of view, focusing on three discursive aspects: specific features of X, pragmatic variables and stylistic variables. Contrary to what some authors have found for English X users (Cicres, 2015), the paper shows that, in Catalan, emoticons, exclamations or letter multiplication are not distinctive features for gender or age. However, elements such as the concatenation of hashtags, the use of links, the intensification of first-person subject pronouns, the use of capital letters or the use suspension points can be meaningful for age ¾but not for gender. This paper thus constitutes a first step towards finding truly distinctive elements in the use of X in Catalan, even though more studies, with larger corpora, need to be done to confirm these tendencies.

  • Research Article
  • Cite Count Icon 23
  • 10.1103/physrevd.98.076017
Deep learning for R -parity violating supersymmetry searches at the LHC
  • Oct 30, 2018
  • Physical Review D
  • Jun Guo + 4 more

Supersymmetry with hadronic R-parity violation in which the lightest neutralino decays into three quarks is still weakly constrained. This work aims to further improve the current search for this scenario by the boosted decision tree method with additional information from jet substructure. In particular, we find a deep neural network turns out to perform well in characterizing the neutralino jet substructure. We first construct a Convolutional Neutral Network (CNN) which is capable of tagging the neutralino jet in any signal process by using the idea of jet image. When applied to pure jet samples, such a CNN outperforms the N-subjettiness variable by a factor of a few in tagging efficiency. Moreover, we find the method, which combines the CNN output and jet invariant mass, can perform better and is applicable to a wider range of neutralino mass than the CNN alone. Finally, the ATLAS search for the signal of gluino pair production with subsequent decay $\tilde{g} \to q q \tilde{\chi}^0_1 (\to q q q)$ is recasted as an application. In contrast to the pure sample, the heavy contamination among jets in this complex final state renders the discriminating powers of the CNN and N-subjettiness similar. By analyzing the jets substructure in events which pass the ATLAS cuts with our CNN method, the exclusion limit on gluino mass can be pushed up by $\sim200$ GeV for neutralino mass $\sim 100$ GeV.

  • Conference Article
  • 10.1109/esci48226.2020.9167546
NAD: Neuron Activation based Divergence Maps for Weakly Supervised Object Localization
  • Mar 1, 2020
  • Siddhant Bagga + 4 more

Convolutional neutral networks (CNNs) have brought about massive improvements in the field of computer vision in solving some of the most conmplex problems like object detection, image captioning, semantic segmentation etc. These networks perform very well for such tasks but very little is known about why they do so. Their lack of transparency makes them difficult to interpret and that is why they are considered as black boxes. In this paper, we have proposed an approach in which we carry out weakly supervised object localization in images which eventually helps us understand the functioning of CNNs by providing us with the visual explanations for the predictions of CNNs. The proposed work focuses on exploiting the learned feature dependencies between consecutive layers of CNN. Different strategies are employed for different types of layers (Fully Connected layer, Convolutional layer etc.) to compute a binary value signifying neuron relevance. Moreover, we employ a method in which the computed activation maps corresponding to the non-target class are discounted from those of the target class in order to eliminate the irrelevant neurons and amplify the most discriminative neurons. This process highlights the most significant neurons of the CNN which have contributed the most in the prediction of a particular object. Our proposed approach performs better than the previously developed techniques with a better accuracy.

  • Research Article
  • Cite Count Icon 1
  • 10.62527/joiv.9.2.2687
Detection of Oil Palm Fruit Ripeness through Image Feature Optimization using Convolutional Neural Network Algorithm
  • Mar 31, 2025
  • JOIV : International Journal on Informatics Visualization
  • Dedy Setiawan + 2 more

The increase in the need for raw materials for palm oil products in the form of food and non-food is felt by the people of Indonesia and other countries. For this reason, triggering oil palm farmers in Indonesia must be able to maximize their production. Currently, oil palm farmers in Indonesia still need help knowing the level of sustainability of oil palm fruit to maintain their production. This research was conducted to identify the maturity level of oil palm fruit using practical images for oil palm farmers in Indonesia. The Convolutional Neutral Network (CNN) algorithm is the research method used to identify pictures of oil palm fruit. The dataset collection comprised 400 images of oil palm fruits divided into three types of classes, namely images of raw, ripe, and rotten oil palm fruits. The dataset was taken from various internet sources, and photos were taken directly using a mobile phone camera according to a predetermined class. This study found that identifying the maturity level of oil palm fruit using the Convolutional Neural Network (CNN) algorithm obtained a high accuracy of 98% in the training process and 76% in the model testing process. The findings of this study can also inspire further research in optimizing image features and using the Convolutional Neural Network (CNN) algorithm more efficiently. This could include a reduction in model training time, the number of parameters, or the development of other techniques that improve algorithm performance.

  • Research Article
  • Cite Count Icon 1
  • 10.31577/cai_2021_2_318
Clustering and Bootstrapping Based Framework for News Knowledge Base Completion
  • Jan 1, 2021
  • Computing and Informatics
  • K Srinivasa + 1 more

Extracting the facts, namely entities and relations, from unstructured sources is an essential step in any knowledge base construction. At the same time, it is also necessary to ensure the completeness of the knowledge base by incrementally extracting the new facts from various sources. To date, the knowledge base completion is studied as a problem of knowledge refinement where the missing facts are inferred by reasoning about the information already present in the knowledge base. However, facts missed while extracting the information from multilingual sources are ignored. Hence, this work proposed a generic framework for knowledge base completion to enrich a knowledge base of crime-related facts extracted from online news articles in the English language, with the facts extracted from low resourced Indian language Hindi news articles. Using the framework, information from any low-resourced language news articles can be extracted without using language-specific tools like POS tags and using an appropriate machine translation tool. To achieve this, a clustering algorithm is proposed, which explores the redundancy among the bilingual collection of news articles by representing the clusters with knowledge base facts unlike the existing Bag of Words representation. From each cluster, the facts extracted from English language articles are bootstrapped to extract the facts from comparable Hindi language articles. This way of bootstrapping within the cluster helps to identify the sentences from a low-resourced language that are enriched with new information related to the facts extracted from a high-resourced language like English. The empirical result shows that the proposed clustering algorithm produced more accurate and high-quality clusters for monolingual and cross-lingual facts, respectively. Experiments also proved that the proposed framework achieves a high recall rate in extracting the new facts from Hindi news articles.

Save Icon
Up Arrow
Open/Close