Abstract

The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally. Information theory gives us tools at hand to measure precisely the average amount of choice associated with words: the word entropy. Here, we use three parallel corpora, encompassing ca. 450 million words in 1916 texts and 1259 languages, to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present two main findings: Firstly, word entropies display relatively narrow, unimodal distributions. There is no language in our sample with a unigram entropy of less than six bits/word. We argue that this is in line with information-theoretic models of communication. Languages are held in a narrow range by two fundamental pressures: word learnability and word expressivity, with a potential bias towards expressivity. Secondly, there is a strong linear relationship between unigram entropies and entropy rates. The entropy difference between words with and without co-textual information is narrowly distributed around ca. three bits/word. In other words, knowing the preceding text reduces the uncertainty of words by roughly the same amount across languages of the world.

Highlights

  • Symbols are the building blocks of information

  • We further argue in the discussion section that word entropies across languages of the world reflect the trade-off between two basic pressures on natural communication systems: word learnability vs. word expressivity

  • Entropy Stabilization throughout the Text Sequence. For both unigram entropies and entropy rates, the stabilization criterion, i.e., SD < 0.1, is met at 50 K tokens. This is the case for the 21 languages of the European Parliament Corpus (EPC)

Read more

Summary

Introduction

Symbols are the building blocks of information. They give rise to surprisal and uncertainty, as a consequence of choice. This is the fundamental concept underlying information encoding. Natural languages are communicative systems harnessing this information-encoding potential. The average amount of information a word can carry is a basic property, an information-theoretic fingerprint that reflects its idiosyncrasies, and sets it apart from other languages across the world

Methods
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.