The effect of author set size and data size in authorship attribution

K Luyckx,W Daelemans

doi:10.1093/llc/fqq013

Abstract

Applications of authorship attribution 'in the wild' (Koppel, M., Schler, J., and Argamon, S. (2010). Authorship attribution in the wild. Language Resources and Evaluation. Advanced Access published January 12, 2010:10.1007/ s10579-009-9111-2), for instance in social networks, will likely involve large sets of candidate authors and only limited data per author. In this article, we present the results of a systematic study of two important parameters in super- vised machine learning that significantly affect performance in computational authorship attribution: (1) the number of candidate authors (i.e. the number of classes to be learned), and (2) the amount of training data available per can- didate author (i.e. the size of the training data). We also investigate the robust- ness of different types of lexical and linguistic features to the effects of author set size and data size. The approach we take is an operationalization of the standard text categorization model, using memory-based learning for discriminating be- tween the candidate authors. We performed authorship attribution experiments on a set of three benchmark corpora in which the influence of topic could be controlled. The short text fragments of e-mail length present the approach with a true challenge. Results show that, as expected, authorship attribution accuracy deteriorates as the number of candidate authors increases and size of training data decreases, although the machine learning approach continues performing significantly above chance. Some feature types (most notably character n-grams) are robust to changes in author set size and data size, but no robust individual features emerge.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The effect of author set size and data size in authorship attribution

Abstract

Talk to us

Similar Papers

More From: Literary and Linguistic Computing

Lead the way for us

Journal: Literary and Linguistic Computing	Publication Date: Aug 16, 2010
Citations: 119

Similar Papers

Authorship attribution and verification with many authors and limited data
Kim Luyckx ... Walter Daelemans
-
Kim Luyckx, et. al.Kim Luyckx ... Walter Daelemans
01 Jan 2008
01 Jan 2008

Influence of Lexical, Syntactic and Structural Features and their Combination on Authorship Attribution for Telugu Text
S Naga Prasad ... V.B Narsimha
Procedia Computer Science | VOL. 48
S Naga Prasad, et. al.S Naga Prasad ... V.B Narsimha
01 Jan 2015
Procedia Computer Science | VOL. 48

Impact of Number of Features Selected and Size of Training Data on the Accuracy of Machine Learning Based Cloud Security Algorithms – An Empirical Analysis
Tanko Yahaya Mohammed
SLU Journal of Science and Technology | VOL. 4
Tanko Yahaya MohammedTanko Yahaya Mohammed
20 Jul 2022
SLU Journal of Science and Technology | VOL. 4

Empirical Evaluations Using Character and Word N-Grams on Authorship Attribution for Telugu Text
S Nagaprasad ... P Vijayapal Reddy
-
S Nagaprasad, et. al.S Nagaprasad ... P Vijayapal Reddy
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The effect of author set size and data size in authorship attribution

Abstract

Talk to us

Similar Papers

More From: Literary and Linguistic Computing