Abstract

We present a hypothetical argument against finite-state processes in statistical language modeling that is based on semantics rather than syntax. In this theoretical model, we suppose that the semantic properties of texts in a natural language could be approximately captured by a recently introduced concept of a perigraphic process. Perigraphic processes are a class of stochastic processes that satisfy a Zipf-law accumulation of a subset of factual knowledge, which is time-independent, compressed, and effectively inferrable from the process. We show that the classes of finite-state processes and of perigraphic processes are disjoint, and we present a new simple example of perigraphic processes over a finite alphabet called Oracle processes. The disjointness result makes use of the Hilberg condition, i.e., the almost sure power-law growth of algorithmic mutual information. Using a strongly consistent estimator of the number of hidden states, we show that finite-state processes do not satisfy the Hilberg condition whereas Oracle processes satisfy the Hilberg condition via the data-processing inequality. We discuss the relevance of these mathematical results for theoretical and computational linguistics.

Highlights

  • The goal of this article is to show that finite-state statistical language models can be refuted using a hypothetical argument that is based on semantics rather than syntax

  • Statistical modeling of texts, called deceptively statistical language modeling, requires that we model language as a system and things that are expressed in language, and these seem to come as a large number of rare events [21,22]

  • We stated the Hilberg hypothesis in terms of algorithmic mutual information and we showed that no finite-state statistical language model is compatible even with so generalized hypothesis, whereas there exist simple perigraphic processes, called Santa Fe and Oracle processes, which are fully compatible with Hilberg’s hypothesis

Read more

Summary

Introduction

The goal of this article is to show that finite-state statistical language models can be refuted using a hypothetical argument that is based on semantics rather than syntax. Perigraphic processes define a model of factual knowledge that is infinite, time-independent, compressed losslessly as much as possible, and effectively described in random texts at a power-law rate. For any stationary streams of data that do not satisfy a power-law growth of computable estimates of algorithmic mutual information, we can effectively tell that they are not perigraphic In this way, we proceed to another important topic, namely, Hilberg’s hypothesis. The most important achievement of our mathematical theory of Hilberg’s hypothesis are so-called theorems about facts and words, cf [27,29,54] and [36] (Section 8.4), that connect this hypothesis with Zipfian power laws for words and for bits of the compressed factual knowledge called facts. We suppose that perigraphic processes are a promising class of abstract statistical language models in which linguistically interpretable properties can be investigated deductively and partly motivated empirically

Aims and Organization of the Article
Some Classes of Processes
Hilberg’s Hypothesis
Finite-State Processes
Perigraphic Processes
Oracle Processes
Discussion
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.