Abstract

Authorship attribution (AA) is a stylometric analysis task of finding the author of an anonymous/disputed text document. In AA, the performance improvement of class-based feature selection schemas, such as Chi-square, and Gini index over frequency-based feature selection schemas, such as document frequency, common n-grams, and inverted document frequency has been shown to be limited. In AA, the feature selection process is significantly affected by topic distributions. In this paper, we assess the performance of a global feature selection approach into which the document’s topic category is incorporated to scale the existing feature weights. In this approach, the common features of an author among different topics indicate higher relevance for the author and thus have higher weights. On the other hand, features with biased topic distributions are assumed to have high topic relevance and lower weights. In this approach, the global topic measure and the author specific topic measure are combined in order to scale the existing selection weights of the features. The ten-fold cross-validation experiment result on a multi-topic dataset with a random topic distribution indicates that our approach improves the performance of Chi-square, modified Gini index, and common n-grams schemas significantly in the best performing configurations of the classifiers.

Highlights

  • The task of authorship attribution (AA) is the identification of the author of a disputed/unknown text document

  • Function words – a well-known feature set in AA – have higher document frequencies, when Inverted document frequency (IDF) selection schema is applied on arbitrary words, most function words will get lower scores and be eliminated

  • Modern feature selection schemas on text classification tasks have been experimented in content dependent tasks where the document content and target label are directly related

Read more

Summary

INTRODUCTION

The task of authorship attribution (AA) is the identification of the author of a disputed/unknown text document. Feature sets suggested for exploiting the stylometric properties of the authors are generally assumed to be topic independent, and they encode little or no information about the content of the document. In recent studies, these feature sets are addressed as vocabulary richness, readability measures, character n-grams, terms and function words [5]–[8]. Odds ratio and chi square have been compared on datasets with very few authors According to these comparisons, simple document frequency (DF) based term selection has been reported to be quite competitive with other feature selection methods [9], [10].

FEATURE SELECTION METHODS
INVERTED DOCUMENT FREQUENCY
CHI-SQUARE
AUTHOR SPECIFIC TOPIC MEASURE
EXPERIMENTS
EVALUATION
RESULTS
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call