Comparative evaluation of term selection functions for authorship attribution

Jacques Savoy

doi:10.1093/llc/fqt047

Abstract

Different computational models have been proposed to automatically determine the most probable author of a disputed text (authorship attribution). These models can be viewed as special approaches in the text categorization domain. In this perspective, in a first step we need to determine the most effective features (words, punctuation symbols, part-of-speech, bigram of words, etc.) to discriminate between different authors. To achieve this, we can consider different independent feature-scoring selection functions (information gain, gain ratio, pointwise mutual information, odds ratio, chi-square, bi-normal separation, GSS, Darmstadt Indexing Approach (DIA), and correlation coefficient). Other term selection strategies have also been suggested in specific authorship attribution studies. To compare these two families of selection procedures, we have extracted articles from two newspapers and belonging to two categories (sports and politics). To enlarge the basis of our evaluations, we have chosen one newspaper written in the English language (‘Glasgow Herald’) and a second one in Italian (‘La Stampa’). The resulting collections contain from 987 to 2,036 articles written by four to ten columnists. Using the Kullback–Leibler divergence, the chi-square measure and the Delta rule as attribution schemes, this study found that some simple selection strategies (based on occurrence frequency or document frequency) may produce similar, and sometimes better, results compared with more complex ones.

Full Text