Towards Authorship Attribution in Arabic Short-Microblog Text

Kamal Mansour Jambi,Muazzam Ahmed Siddiqui,Salma Omar Alhaj,Imtiaz Hussain Khan

doi:10.1109/access.2021.3112624

Kamal Mansour Jambi, Muazzam Ahmed Siddiqui + Show 2 more

Open Access

https://doi.org/10.1109/access.2021.3112624

Copy DOI

Abstract

Authorship attribution is the study to identify individuals by their writing styles without knowing their actual identities. This is a challenging task in natural language processing. Most work on authorship attribution focused on English, whereas, the problem is understudied in Arabic language. However, due to the complex and distinct morphological nature of the Arabic language, techniques developed for English are not directly applicable to Arabic. This paper explored the possibility of using state-of-the-art classifiers, Support Vector Machines (SVM), K-Nearest Neighbours (KNN) and Random Forest, to predict authorship in Arabic short-microblog text. We employed three commonly used linguistic features, character-, lexical- and syntactic-based, in an incremental manner to predict the accuracy of the selected classifiers. The results elucidate that a systematic combination of linguistic features improves authorship classification. However, an inverse correlation was observed in authorship classification accuracy and the number of authors. Overall, SVM and Random Forest classifier are comparable and attained ~65% accuracy, whereas KNN hardly attained ~35% accuracy. In addition, lexical features offer more discriminatory power as compared to the character and syntactic features.

Highlights

Authorship attribution is the task of determining the author of a document or a text by inferring the characteristics of the text based on the different features extracted from these texts with different feature engineering and feature selection techniques
The results indicate that Support Vector Machines (SVM) attained the maximum accuracy
Existing work on Arabic authorship attribution has predominantly focused on longer texts such as books, scientific papers or essays

Summary

Introduction

Authorship attribution is the task of determining the author of a document or a text by inferring the characteristics of the text based on the different features extracted from these texts with different feature engineering and feature selection techniques. Authorship attribution has become a major problem, as the range of anonymous information has increased drastically with the accelerated growth of Internet usage worldwide. Internet technology facilitates all communication, albeit in different ways, and has become more accessible. Social networks develop and recreate a new view of user communication in different ways. Each user can create an account and post a message on their individual profile. Many of these accounts are anonymous, and require the identification of the author’s true identity

Objectives

Methods

Results

Conclusion