Abstract

Authorship attribution is the study to identify individuals by their writing styles without knowing their actual identities. This is a challenging task in natural language processing. Most work on authorship attribution focused on English, whereas, the problem is understudied in Arabic language. However, due to the complex and distinct morphological nature of the Arabic language, techniques developed for English are not directly applicable to Arabic. This paper explored the possibility of using state-of-the-art classifiers, Support Vector Machines (SVM), K-Nearest Neighbours (KNN) and Random Forest, to predict authorship in Arabic short-microblog text. We employed three commonly used linguistic features, character-, lexical- and syntactic-based, in an incremental manner to predict the accuracy of the selected classifiers. The results elucidate that a systematic combination of linguistic features improves authorship classification. However, an inverse correlation was observed in authorship classification accuracy and the number of authors. Overall, SVM and Random Forest classifier are comparable and attained ~65% accuracy, whereas KNN hardly attained ~35% accuracy. In addition, lexical features offer more discriminatory power as compared to the character and syntactic features.

Highlights

  • Authorship attribution is the task of determining the author of a document or a text by inferring the characteristics of the text based on the different features extracted from these texts with different feature engineering and feature selection techniques

  • The results indicate that Support Vector Machines (SVM) attained the maximum accuracy

  • Existing work on Arabic authorship attribution has predominantly focused on longer texts such as books, scientific papers or essays

Read more

Summary

Introduction

Authorship attribution is the task of determining the author of a document or a text by inferring the characteristics of the text based on the different features extracted from these texts with different feature engineering and feature selection techniques. Authorship attribution has become a major problem, as the range of anonymous information has increased drastically with the accelerated growth of Internet usage worldwide. Internet technology facilitates all communication, albeit in different ways, and has become more accessible. Social networks develop and recreate a new view of user communication in different ways. Each user can create an account and post a message on their individual profile. Many of these accounts are anonymous, and require the identification of the author’s true identity

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call