Abstract

Despite being a paradigm of quantitative linguistics, Zipf’s law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf’s law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf’s law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30 000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf’s law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than 40% of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies (from 1 to the maximum value), and with only one free parameter (the exponent).

Highlights

  • Zipf’s law constitutes a striking quantitative regularity in the usage of language [1,2,3,4]

  • The same law has been claimed in other codes of communication, as in music [8] or for the timbres of sounds [9], and PLOS ONE | DOI:10.1371/journal.pone

  • In order to fit these three distributions to the different texts, and test the goodness of such fits, we use maximum likelihood estimation [46] followed by the Kolmogorov-Smirnov (KS) test [47]

Read more

Summary

Introduction

Zipf’s law constitutes a striking quantitative regularity in the usage of language [1,2,3,4]. A slightly more general formulation includes a parameter in the form of an exponent α; the rank-frequency relation takes the form of a power law, n/ 1: ra ð1Þ with the value of α close to one. This pattern Eq (1) has been found across different languages, literary styles, time periods, and levels of morphological abstraction [2, 5,6,7]. The same law has been claimed in other codes of communication, as in music [8] or for the timbres of sounds [9], and PLOS ONE | DOI:10.1371/journal.pone.0147073 January 22, 2016

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.