Abstract

A fundamental problem in linguistics is how literary texts can be quantified mathematically. It is well known that the frequency of a (rare) word in a text is roughly inverse proportional to its rank (Zipf’s law). Here we address the complementary question, if also the rhythm of the text, characterized by the arrangement of the rare words in the text, can be quantified mathematically in a similar basic way. To this end, we consider representative classic single-authored texts from England/Ireland, France, Germany, China, and Japan. In each text, we classify each word by its rank. We focus on the rare words with ranks above some threshold Q and study the lengths of the (return) intervals between them. We find that for all texts considered, the probability SQ(r) that the length of an interval exceeds r, follows a perfect Weibull-function, SQ(r) = exp(−b(β)rβ), with β around 0.7. The return intervals themselves are arranged in a long-range correlated self-similar fashion, where the autocorrelation function CQ(s) of the intervals follows a power law, CQ(s) ∼ s−γ, with an exponent γ between 0.14 and 0.48. We show that these features lead to a pronounced clustering of the rare words in the text.

Highlights

  • IntroductionAccording to Zipf [1], the frequency of a word as function of its rank follows approximately a power law, and the number of different words in a text increases with its length roughly by a power law [2, 3]

  • Can literature be characterized by mathematical laws? According to Zipf [1], the frequency of a word as function of its rank follows approximately a power law, and the number of different words in a text increases with its length roughly by a power law [2, 3]

  • In this article we considered 10 long literary texts from England/Ireland, France, Germany, China, and Japan and studied systematically the occurrence of the rare words in a text

Read more

Summary

Introduction

According to Zipf [1], the frequency of a word as function of its rank follows approximately a power law, and the number of different words in a text increases with its length roughly by a power law [2, 3]. The question is if the rhythm of the text characterized by the arrangement of lower and higher ranked words, can be quantified mathematically in a similar basic way. In the last decades, when analyzing the rhythm of a text, the text was usually mapped onto a sequence {yi}, i = 1, . . ., N, of numbers that specify either the lengths of words or sentences, or the ranks or frequencies of each word, or mapped into various binary sequences that specify the occurrences of specific words. Record analysis methods from statistical physics like Hurst analysis [4], (multifractal) detrended fluctuation analysis (DFA and MF-DFA) [5, 6], or entropy measures have been used to search for linear and nonlinear memory in the text [7,8,9,10,11,12,13].

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.