Abstract

As we discuss, a stationary stochastic process is nonergodic when a random persistent topic can be detected in the infinite random text sampled from the process, whereas we call the process strongly nonergodic when an infinite sequence of independent random bits, called probabilistic facts, is needed to describe this topic completely. Replacing probabilistic facts with an algorithmically random sequence of bits, called algorithmic facts, we adapt this property back to ergodic processes. Subsequently, we call a process perigraphic if the number of algorithmic facts which can be inferred from a finite text sampled from the process grows like a power of the text length. We present a simple example of such a process. Moreover, we demonstrate an assertion which we call the theorem about facts and words. This proposition states that the number of probabilistic or algorithmic facts which can be inferred from a text drawn from a process must be roughly smaller than the number of distinct word-like strings detected in this text by means of the Prediction by Partial Matching (PPM) compression algorithm. We also observe that the number of the word-like strings for a sample of plays by Shakespeare follows an empirical stepwise power law, in a stark contrast to Markov processes. Hence, we suppose that natural language considered as a process is not only non-Markov but also perigraphic.

Highlights

  • One of the motivating assumptions of information theory [1,2,3] is that communication in natural language can be reasonably modeled as a discrete stationary stochastic process, namely, an infinite sequence of discrete random variables with a well defined time-invariant probability distribution

  • We will call a process perigraphic if the number of algorithmic facts which can be inferred from a finite text sampled from the process grows like a power of the text length

  • A stationary process has been called strongly nonergodic if some persistent random topic can be detected in the process and an infinite number of independent binary random variables, called probabilistic facts, is needed to describe this topic completely

Read more

Summary

Introduction

One of the motivating assumptions of information theory [1,2,3] is that communication in natural language can be reasonably modeled as a discrete stationary stochastic process, namely, an infinite sequence of discrete random variables with a well defined time-invariant probability distribution. [19], it was heuristically shown that, if Hilberg’s hypothesis for mutual information is satisfied for an arbitrary stationary stochastic process, texts drawn from this process satisfy a kind of Heaps’ law if we detect the words using the grammar-based codes [20,21,22,23] This result is a historical antecedent of the theorem about facts and words. We present two cases of the theorem: one for strongly nonergodic processes, applying Shannon information theory, and one for general stationary processes, applying algorithmic information theory Having these results, we can supplement them with a rudimentary discussion of some empirical data. In Appendix C, we show that that the number of inferrable facts for the Santa Fe processes follows a power law

Ergodic and Nonergodic Processes
Strongly Nonergodic Processes
Perigraphic Processes
Theorem about Facts and Words
Hilberg Exponents and Empirical Data
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call