Abstract

The Authorship Attribution (AA) is considered as a subfield of authorship analysis and it is an important problem as the range of anonymous information increased with fast-growing of internet usage worldwide. In other languages such as English, Spanish and Chinese, such issue is quite well studied. However, in the Arabic language, the AA problem has received less attention from the research community due to the complexity and nature of Arabic sentences. The paper presented an intensive review of previous studies for Arabic language. Based on that, this study has employed the Technique for Order Preferences by Similarity to Ideal Solution (TOPSIS) method to choose the base classifier of the ensemble methods. In terms of attribution features, hundreds of stylometric features and distinct words using several tools have been extracted. Then, AdaBoost and Bagging ensemble methods have been applied to Arabic enquires (Fatwa) dataset. The findings showed an improvement of the effectiveness of the authorship attribution task in the Arabic language.

Highlights

  • From linguistics analysis perspective, authorship attribution (AA) aims to identify the original author of an unseen text

  • We found a large number of methods and approaches that were developed to tackle the Authorship Attribution (AA) problem such as Support Vector Machine (SVM) [18]–[23], naive Bayes (NB) [4], [20], [24], [25], Bayesian classifiers [26], [27], k-nearest neighbor (k-NN) [28], [29], decision trees [30], and VOLUME 8, 2020

  • FEATURE-BASED LEVEL To investigate the performance of using different stylometric features (ASFMs, DWs, and ASFMs + DWs), Tables 7-14 summarize the results obtained by the two ensemble methods on balanced and imbalanced datasets in terms of the accuracy, recall, precision and F1-score

Read more

Summary

Introduction

Authorship attribution (AA) aims to identify the original author of an unseen text. From the 19th century, several approaches have been proposed to tackle the AA problem. The early approaches had a statistical background [1]–[4] where the length and frequency of words, characteristics, and sentences were used to characterize the writing style. These approaches, in general, were human expert-based [5] and the applications covered literary, religious and legal texts [6]. From the sixties of the last century up until the1990s, both the approaches and applications were shifted to cover new challenging problems such as

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call