Abstract
Applications of authorship attribution 'in the wild' (Koppel, M., Schler, J., and Argamon, S. (2010). Authorship attribution in the wild. Language Resources and Evaluation. Advanced Access published January 12, 2010:10.1007/ s10579-009-9111-2), for instance in social networks, will likely involve large sets of candidate authors and only limited data per author. In this article, we present the results of a systematic study of two important parameters in super- vised machine learning that significantly affect performance in computational authorship attribution: (1) the number of candidate authors (i.e. the number of classes to be learned), and (2) the amount of training data available per can- didate author (i.e. the size of the training data). We also investigate the robust- ness of different types of lexical and linguistic features to the effects of author set size and data size. The approach we take is an operationalization of the standard text categorization model, using memory-based learning for discriminating be- tween the candidate authors. We performed authorship attribution experiments on a set of three benchmark corpora in which the influence of topic could be controlled. The short text fragments of e-mail length present the approach with a true challenge. Results show that, as expected, authorship attribution accuracy deteriorates as the number of candidate authors increases and size of training data decreases, although the machine learning approach continues performing significantly above chance. Some feature types (most notably character n-grams) are robust to changes in author set size and data size, but no robust individual features emerge.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.