Abstract
We show that the mutual information between two symbols, as a function of the number of symbols between the two, decays exponentially in any probabilistic regular grammar, but can decay like a power law for a context-free grammar. This result about formal languages is closely related to a well-known result in classical statistical mechanics that there are no phase transitions in dimensions fewer than two. It is also related to the emergence of power law correlations in turbulence and cosmological inflation through recursive generative processes. We elucidate these physics connections and comment on potential applications of our results to machine learning tasks like training artificial recurrent neural networks. Along the way, we introduce a useful quantity, which we dub the rational mutual information, and discuss generalizations of our claims involving more complicated Bayesian networks.
Highlights
Critical behavior, where long-range correlations decay as a power law with distance, has many important physics applications ranging from phase transitions in condensed matter experiments to turbulence and inflationary fluctuations in our early Universe
As discussed in previous works [9,11,13], the plot shows that the number of bits of information provided by a symbol about another drops roughly as a power law
Just how generic is the scaling behavior of our model? What if the length of the words is not constant? What about more complex dependencies between layers? If we retrace the derivation in the above arguments, it becomes clear that the only key feature of all of our models considered so far is that the rational mutual information decays exponentially with the causal distance ∆: IR ∼ e−γ∆
Summary
Critical behavior, where long-range correlations decay as a power law with distance, has many important physics applications ranging from phase transitions in condensed matter experiments to turbulence and inflationary fluctuations in our early Universe. All measured curves are seen to decay roughly as power laws, explaining why they cannot be accurately modeled as Markov processes, for which the mutual information instead plummets exponentially (the example shown has I ∝ e−d/6 ). We will show that computer descriptions of language suffer from a much simpler problem that has involved no talk about meaning or being non-human: they tend to get the basic statistical properties wrong To illustrate this point, consider Markov models of natural language. Linguistic arguments typically do not produce an observable that can be used to quantitatively falsify any Markovian model of language Instead, these arguments rely on highly specific knowledge about the data, in this case, an understanding of the language’s grammar.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.