Abstract

We suggest two approaches to the statistical analysis of texts, both based on the study of numerals occurrence in literary texts. The first approach is related to Benford’s Law and the analysis of the frequency distribution of various leading digits of numerals contained in the text. In coherent literary texts, the share of the leading digit 1 is even larger than prescribed by Benford’s Law and can reach 50 percent. The frequencies of occurrence of the digit 1, as well as, to a lesser extent, the digits 2 and 3, are usually a characteristic the author’s style feature, manifested in all (sufficiently long) literary texts of any author. This approach is convenient for testing whether a group of texts has common authorship: the latter is dubious if the frequency distributions are sufficiently different. The second approach is the extension of the first one and requires the study of the frequency distribution of numerals themselves (not their leading digits). The approach yields non-trivial information about the author, stylistic and genre peculiarities of the texts and is suited for the advanced stylometric analysis. The proposed approaches are illustrated by examples of computer analysis of the literary texts in English and Russian.

Highlights

  • Benford’s Law [1]—a strange manifestation of the law of large numbers—is sometimes rightfully called curious, surprising and mysterious [2,3]

  • The fact that the digit 7 occurs too often is explained by the biblical number symbolism in which this number occupies a dominant position. This is all that was in the scientific literature about the analysis of the occurrence of numerals in texts in connection with Benford’s Law and in relation to stylometry problems in 2014, when our research in this area began [18]

  • We started our research in the general direction of “putting yet another object to the test for Benfordness”, but the obtained results corrected our intentions and served as a starting point for stylometric research for several years: Texts written by Caesar are characterized by a similar and abnormally low share of 1 as the first significant digit: it does not even reach Benford’s 30 percent

Read more

Summary

Introduction

Benford’s Law [1]—a strange manifestation of the law of large numbers (understood as the combined action of a large number of random factors leading to a result that is almost independent of the case)—is sometimes rightfully called curious, surprising and mysterious [2,3]. There is still no complete understanding of why some data sets obey this law, while others do not. Incomplete understanding does not prevent the emergence of more and more proposals for the practical use of Benford’s Law in a wide area of sciences from geodesy [5] and geology [6] through genomics [7] and ecology [8,9] to scientometrics [10]. In the USA, evidence based on Benford’s Law [16] has been admitted in criminal cases of financial fraud at the federal, state, and local levels. The work is heuristic by nature and does not aim at theoretical justification of the results (if that is even possible), which, does not detract from the possibility of applying the proposed methodology for stylometry tasks

Benford’s Law and Texts
Distribution When of the First
Coherent Literary
Twain’s
First Significant Digits and Texts Authorship Attribution
Statistical Characteristics of Translated
Statistical Characteristics of Translated Texts
Who Wrote “The Twelve Chairs”?
Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.