Abstract

In this article we revisit a dividing issue as regards the corpus of one of the most famous nineteenth-century philosophers: John Stuart Mill. He was the author of two iconic texts in the history of political philosophy: On Liberty and The Subjection of Women . However, Mill attributed the first to collaboration with Harriet Taylor Mill, his wife, and characterized the second as a work of three minds: his own, his wife’s and her daughter, Helen Taylor. Experts disagree on this issue. Most think Mill was too generous sharing authorship credit. We use a training set consisted in manuscripts of the three above mentioned authors, to train a four-class problem (three authors and joint productions). For every manuscript in the training set we extract a set of features that are widely used in text analytics and classification. Then, we apply some pre-processing techniques to normalize the data and to reduce the number of features. Finally, we train three classifiers, namely k-nearest neighbours (k-NNs) with k = 1 and k = 2, support vector machines (SVMs), and decision trees (DTs) to attribute the texts of “disputed” authorship to one of the four potential authors. We routinely run the experiments using different feature sets every time, in order to identify the optimal combination of features that yield the best results on the test set. The best results are achieved with the SVMs, having as input the bigrams features and their principal components. The mean detection rate for all four classes is 100%. Similar results are achieved with the models built with the k-NNs (k = 1) and the DTs. The only classifier that consistently is returning significantly lower results is the k-NN with k = 2. All of the instances in the test set are attributed to John Stuart Mill.

Highlights

  • The need for developing systems that can automatically attribute an author to a given text has a sense of urgency of late, due to the dramatical increment of texts in which their content is somewhat of a public threat and the author is not known – for example, the possible incitement of people to violent behavior, either towards others or one’s self, through social media

  • For every fold we compute the detection rates (DR) for both training and test sets and we report the results as an average DR

  • TRAINING SETS AND MODELLING 1) TRAINING SETS In this work we focus on testing several feature sets to identify the optimal set that can better distinguish the four requested classes

Read more

Summary

Introduction

The need for developing systems that can automatically attribute an author to a given text has a sense of urgency of late, due to the dramatical increment of texts in which their content is somewhat of a public threat and the author is not known – for example, the possible incitement of people to violent behavior, either towards others or one’s self, through social media. Automated Authorship Attribution (AA) of texts has several applications including criminal investigations (e.g. authenticity of suicide notes), identifying the authors of harassing emails and other [1], [2]. Anonymously for various reasons: the threat of censorship, prosecution or persecution, to dissasociate a text from one particular individual or even to cheat the reading public. One famous attempt for attributing important eighteenth-century political texts is the work of Mosteller and Wallace on ‘‘The Federalist Papers’’ [3]. John Stuart Mill (1806–1873) was a very famous British philosopher in the nineteenth century. His influence is still visible today in political and social philosophy, the methodology of the social sciences, as well as economic theory

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call