Authorship Attribution of Short Historical Arabic Texts using Stylometric Features and a KNN Classifier with Limited Training Data

Fatma Howedi,Zahra Aborawi Aborawi,Masnizah Mohd,Salah A Jowan

doi:10.3844/jcssp.2020.1334.1345

Abstract

Authorship Attribution (AA) is a task that aims to recognize the authorship of unknown texts based on writing style. Out of the various approaches to solve the AA problem, Stylometry is a promising one. This paper explores the use of a K-Nearest Neighbor (KNN) classifier combined with stylometry features to perform AA. This study indicates the robustness of KNN in performing AA on short historical Arabic texts written by different authors. To classify the texts according to the author, KNN was trained with a set of stylometry features including rare words, count characters and 2-, 3- and 4-grams character levels. Various feature set sizes ranging from 34 to 2000 were tested in the experiment. The experiments were conducted on limited training data with datasets consisting of 3 short texts per the author’s book. This method proved to be at least as effective as Information Gain (IG) when selecting the most significant n-grams. Moreover, the KNN classifier achieved high accuracy results with the best classification accuracy of up to 90%, except for the 5-KK using the 4-gram character level. This work contributes towards utilizing KNN for identifying the distinctive stylometry feature for robust AA identification in short historical Arabic texts.

Highlights

Authorship Attribution (AA) is the process of identifying the author of anonymous texts by providing some samples of texts of a few authors as a training set, assuming that the anonymous text is written by one of the authors of the known text samples (Shaker and Corne, 2010; Nirkhi et al, 2014)
A set of experiments was run to evaluate the effect of short Arabic texts with limited training data on different features to show the robustness of the K-Nearest Neighbor (KNN) performance
We investigated the performance of KNN with small dataset consisted only 30 short Arabic texts written by 10 authors as the case study of (Ouamour and Sayoud, 2018; 2012)

Summary

Introduction

Authorship Attribution (AA) is the process of identifying the author of anonymous texts by providing some samples of texts of a few authors as a training set, assuming that the anonymous text is written by one of the authors of the known text samples (Shaker and Corne, 2010; Nirkhi et al, 2014). With different data sources such as articles and books, feature sets and classifiers may behave differently in AA (Bozkurt et al, 2007). These differences make AA more challenging than TC. The present paper focused mainly on the issue of short-text, which refers to the amount of training data available per author. One of the fundamental sub-problems of AA is the extraction of the most suitable features to represent the writing style of each author. This paper used the K-Nearest Neighbor (KNN) classifier to classify AA by extracting various character n-grams and lexical feature vectors of the writing style per author, as

Objectives

Methods

Results

Conclusion