An Authorship Identification Empirical Evaluation of Writing Style Features in Cross-Topic and Cross-genre Documents

Simisani Ndaba,Edwin Thuma,Gontlafetse Mosweunyane

doi:10.5121/ijaia.2023.14101

Abstract

In this paper, an investigation was done to identify writing style features that can be used for cross-topic and cross-genre documents in the Authorship Identification task from 2003 to 2015. Different writing style features were empirically evaluated that were previously used in single topic and single genre documents for Authorship Identification to determine whether they can be used effectively for cross-topic and crossgenre Authorship Identification using an ablation process. The dataset used was taken from the 2015 PAN CLEF Forum English collection consisting of 100 sets. Furthermore, it was investigated whether combining some of these feature sets can help improve the authorship identification task. Three different classifiers were used: Naïve Bayes, Support Vector Machine, and Random Forest. The results suggest that a combination of a lexical, syntactical, structural, and content feature set can be used effectively for cross topic and cross genre authorship identification, as it achieved an AUC result of 0.837.

Full Text