Abstract

The word-frequency distribution provides the fundamental building blocks that generate discourse in natural language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf’s law, at least approximately. Following Stephens and Bialek (2010), we interpret the frequency of any word as arising from the interaction potentials between its constituent letters. Indeed, Jaynes’ maximum-entropy principle, with the constrains given by every empirical two-letter marginal distribution, leads to a Boltzmann distribution for word probabilities, with an energy-like function given by the sum of the all-to-all pairwise (two-letter) potentials. The so-called improved iterative-scaling algorithm allows us finding the potentials from the empirical two-letter marginals. We considerably extend Stephens and Bialek’s results, applying this formalism to words with length of up to six letters from the English subset of the recently created Standardized Project Gutenberg Corpus. We find that the model is able to reproduce Zipf’s law, but with some limitations: the general Zipf’s power-law regime is obtained, but the probability of individual words shows considerable scattering. In this way, a pure statistical-physics framework is used to describe the probabilities of words. As a by-product, we find that both the empirical two-letter marginal distributions and the interaction-potential distributions follow well-defined statistical laws.

Highlights

  • As a by-product, we find that both the empirical two-letter marginal distributions and the interaction-potential distributions follow well-defined statistical laws

  • Zipf’s law is a pattern that emerges in many complex systems composed by individual elements that can be grouped into different classes or types [1]

  • The “size” of a word type is given by the number of tokens of it that appear in the text under study, and the linguistic version of Entropy 2020, 22, 179; doi:10.3390/e22020179

Read more

Summary

Introduction

Zipf’s law is a pattern that emerges in many complex systems composed by individual elements that can be grouped into different classes or types [1]. The resulting Boltzmann-like factor allows one to identify, in a natural way, the Lagrange multipliers (obtained in the maximization of entropy under the empirical values of the constrains) with the interaction potentials (with a negative sign). We will considerably extend their results by analyzing words of any length from 1 to 6 letters in a much larger English corpus, and will undertake a quantitative statistical analysis of the fulfillment of Zipf’s law In this way, using Shannon and Jaynes’ framework we will obtain a Boltzmann-like factor for the word probabilities that will allow a direct comparison with Zipf’s law. 2. Maximum Entropy and Pairwise Interactions “Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge,” which leads to a special type of statistical inference. The analogy one finds in this way (through the Boltzmann factor) is so neat that it is not possible to escape identifying that with a sort of “energy”

Feature Functions and Marginal Probabilities
Pairwise Constrains
Data and Results
Marginal Distributions
Word Distributions
Values of Lagrange Multipliers and Potentials
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.