Abstract

The dependence on text length of the statistical properties of word occurrences has long been considered a severe limitation on the usefulness of quantitative linguistics. We propose a simple scaling form for the distribution of absolute word frequencies that brings to light the robustness of this distribution as text grows. In this way, the shape of the distribution is always the same, and it is only a scale parameter that increases (linearly) with text length. By analyzing very long novels we show that this behavior holds both for raw, unlemmatized texts and for lemmatized texts. In the latter case, the distribution of frequencies is well approximated by a double power law, maintaining the Zipf's exponent value γ ≃ 2 for large frequencies but yielding a smaller exponent in the low-frequency regime. The growth of the distribution with text length allows us to estimate the size of the vocabulary at each step and to propose a generic alternative to Heaps' law, which turns out to be intimately connected to the distribution of frequencies, thanks to its scaling behavior.

Highlights

  • They propose a size-dependent word-frequency distribution based on three main assumptions: (i) The vocabulary scales with text length as VL ∝ Lα(L), where the exponent α(L) itself depends on the text length

  • A scaling function g(x) provides a constant shape for the distribution of frequencies of each text, DL(n), no matter its length L, which only enters into the distribution as a scale parameter and determines the size of the vocabulary VL

  • The apparent size-dependent exponent found previously seems to be an artifact of the slight convexity of g(x) in a log–log plot, which is more clearly observed for very small values of x, accessible only for the largest text lengths

Read more

Summary

The scaling form of the word-frequency distribution

Let us come back to the rank-frequency relation, in which the absolute frequency n of each type is a function of its rank r. This turns out to be a scaling law, with G(x) a scaling function It means that if in the first 10 000 tokens of a book there are five types with relative frequency larger than or equal to 2%, that is, G(0.02) = 5, this will still be true for the first 20 000 tokens, and for the first 100 000 and for the whole book. If one does not trust the continuous approximation, one can write DL(n) = SL(n) − SL(n + 1) and perform a Taylor expansion, for which the result is the same, but with g(x) −G (x) In this way, we obtain simple forms for SL(n) and DL(n), which are analogous to standard scaling laws, except for the fact that we have not specified how VL changes with L.

Data analysis results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call