Stylometry and Numerals Usage: Benford’s Law and Beyond

Andrei V Zenkov

doi:10.3390/stats4040060

Abstract

We suggest two approaches to the statistical analysis of texts, both based on the study of numerals occurrence in literary texts. The first approach is related to Benford’s Law and the analysis of the frequency distribution of various leading digits of numerals contained in the text. In coherent literary texts, the share of the leading digit 1 is even larger than prescribed by Benford’s Law and can reach 50 percent. The frequencies of occurrence of the digit 1, as well as, to a lesser extent, the digits 2 and 3, are usually a characteristic the author’s style feature, manifested in all (sufficiently long) literary texts of any author. This approach is convenient for testing whether a group of texts has common authorship: the latter is dubious if the frequency distributions are sufficiently different. The second approach is the extension of the first one and requires the study of the frequency distribution of numerals themselves (not their leading digits). The approach yields non-trivial information about the author, stylistic and genre peculiarities of the texts and is suited for the advanced stylometric analysis. The proposed approaches are illustrated by examples of computer analysis of the literary texts in English and Russian.

Highlights

Benford’s Law [1]—a strange manifestation of the law of large numbers—is sometimes rightfully called curious, surprising and mysterious [2,3]
The fact that the digit 7 occurs too often is explained by the biblical number symbolism in which this number occupies a dominant position. This is all that was in the scientific literature about the analysis of the occurrence of numerals in texts in connection with Benford’s Law and in relation to stylometry problems in 2014, when our research in this area began [18]
We started our research in the general direction of “putting yet another object to the test for Benfordness”, but the obtained results corrected our intentions and served as a starting point for stylometric research for several years: Texts written by Caesar are characterized by a similar and abnormally low share of 1 as the first significant digit: it does not even reach Benford’s 30 percent

Summary

Introduction

Benford’s Law [1]—a strange manifestation of the law of large numbers (understood as the combined action of a large number of random factors leading to a result that is almost independent of the case)—is sometimes rightfully called curious, surprising and mysterious [2,3]. There is still no complete understanding of why some data sets obey this law, while others do not. Incomplete understanding does not prevent the emergence of more and more proposals for the practical use of Benford’s Law in a wide area of sciences from geodesy [5] and geology [6] through genomics [7] and ecology [8,9] to scientometrics [10]. In the USA, evidence based on Benford’s Law [16] has been admitted in criminal cases of financial fraud at the federal, state, and local levels. The work is heuristic by nature and does not aim at theoretical justification of the results (if that is even possible), which, does not detract from the possibility of applying the proposed methodology for stylometry tasks

Benford’s Law and Texts

Distribution When of the First

Coherent Literary

Twain’s

First Significant Digits and Texts Authorship Attribution

Statistical Characteristics of Translated

Statistical Characteristics of Translated Texts

Who Wrote “The Twelve Chairs”?

Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Stats	Publication Date: Dec 14, 2021
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Stylometry and Numerals Usage: Benford’s Law and Beyond

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Stats

Lead the way for us

Similar Papers

Creating Spaces for the Development of Critical Awareness through Literature: The Methodological Anatomy of Literary Texts in English
Yolanda Caballero Aceituno
International Education Studies | VOL. 4
Yolanda Caballero AceitunoYolanda Caballero Aceituno
27 Nov 2011
International Education Studies | VOL. 4

Numerals in authorial Turkish-language texts and the stylometric analysis
Andrei Zenkov ... Miroslav Zenkov
E3S Web of Conferences | VOL. 270
Andrei Zenkov, et. al.Andrei Zenkov ... Miroslav Zenkov
01 Jan 2020
E3S Web of Conferences | VOL. 270

A Pragmatic-Semantic Study of Colour Symbolism in English and Arabic Literary Texts
Asst Prof Qasim Abbas Dhayef, Noor Al-Huda Kadhim Hussein
Psychology and Education Journal | VOL. 58
Asst Prof Qasim Abbas Dhayef, Noor Al-Huda Kadhim HusseinAsst Prof Qasim Abbas Dhayef, Noor Al-Huda Kadhim Hussein
01 Jan 2020
Psychology and Education Journal | VOL. 58

THE LINGUO-COGNITIVE ASPECT OF EKPHRASTIC REFERENCES IN A LITERARY TEXT (BASED ON THE WORKS BY D. RUBINA AND M. ATWOOD)
Polina I Gavin ... Olga B Ponomareva
Tyumen State University Herald. Humanities Research. Humanitates | VOL. 7
Polina I Gavin, et. al.Polina I Gavin ... Olga B Ponomareva
01 Jan 2020
Tyumen State University Herald. Humanities Research. Humanitates | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Stylometry and Numerals Usage: Benford’s Law and Beyond

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Stats