Abstract

<p>We proposed statistical analysis of the heterogeneity of literary style in a set of texts that simultaneously use different stylometric characteristics, like word length and the frequency of function words. The data set consists of several tables with the same number of rows, with the i-th row of all tables corresponding to the i-th text. The analysis proposed clusters the rows of all these tables simultaneously into groups with homogeneous style, based on a finite mixture of sets of multinomial models, one set for each table. Different from the usual heuristic cluster analysis approaches, our method naturally incorporates the text size, the discrete nature of the data, and the dependence between categories in the analysis. The model is checked and chosen with the help of posterior predictive checks, together with the use of closed form expressions for the posterior probabilities that each of the models considered to be appropriate. This is illustrated through an analysis of the heterogeneity in Shakespeare’s plays, and by revisiting the authorshipattributionproblem of Tirant lo Blanc.</p>

Highlights

  • The statistical analysis of literary style has often been used to characterize the style of texts and authors, and to help settle authorship-attribution problems both in the academic as well as in the legal context

  • Other characteristics widely used for this purpose have been the proportion of nouns, articles, adjectives or adverbs; the frequency of use of function words, which are independent of the context, or of characters; and the richness and diversity of the vocabulary used by the author

  • The data consisted of two contingency tables of ordered rows, with the i-th row in both tables corresponding to the i-th chapter of the book, and the cluster analysis of the rows of each one of these two tables was carried out separately based on a finite mixture of multinomial models

Read more

Summary

Introduction

The statistical analysis of literary style has often been used to characterize the style of texts and authors, and to help settle authorship-attribution problems both in the academic as well as in the legal context. Mendenhall (1887, 1901) has used word length and sentence length to characterize literary style. Model based approaches simultaneously group objects and estimate the component parameters, and this avoids the biases appearing if it is done separately These methods have the advantage of providing a measure of the uncertainty by allocating individual observations into clusters, and by casting the choice of the number of clusters and of component distributions as a statistical model selection problem. The data consisted of two contingency tables of ordered rows, with the i-th row in both tables corresponding to the i-th chapter of the book, and the cluster analysis of the rows of each one of these two tables was carried out separately based on a finite mixture of multinomial models By using these models to implement a cluster analysis, the texts classified based on the whole vector of word length or of function word counts instead of using only individual counts. Without making a list of candidate authors and of training texts explicit, there is no legitimate statistical way of going beyond proposing tentative explanations for the heterogeneities detected in the corpus

Description of the Data
Chapter 1
Description of the Multinomial Cluster Model
The choice of the Number of Clusters
Choice of s Through Model-Checking
Choice of s Through Model Selection
CaseStudy 1
Case study 2
Final Comments
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.