Abstract

Zipf’s law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf’s law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf’s law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf’s law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation.

Highlights

  • Zipf’s law for word frequencies is one of the best known statistical regularities of language [1, 2]

  • For the second power-law regime reported in Ref. [26] for the high-rank domain of lemmas, we only find it for the smallest frequencies in two Finnish novels, Kevät ja takatalvi and Vanhempieni romaani with exponents γ = 1.715 and 1.77±0.008, respectively

  • The tendency to satisfy this inequality is supported by the slight increase of the exponent α when moving from words to lemmas that has been reported in previous research [26, 27] and that we have reviewed in the Introduction

Read more

Summary

Introduction

Zipf’s law for word frequencies is one of the best known statistical regularities of language [1, 2]. In its most popular formulation, the law states that the frequency n of the r-th most frequent word of a text follows nðrÞ / 1 ra. ; ð1Þ where α is a constant and / the symbol of proportionality. Eq (1) is not the only possible approach for modeling word frequencies in texts. One could look at the number of different words with a given frequency in a text. The probability f(n) that a word has frequency n is given by f ðnÞ 1 ng ð2Þ.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call