Abstract

Entropy estimation faces numerous challenges when applied to various real-world problems. Our interest is in divergence and entropy estimation algorithms which are capable of rapid estimation for natural sequence data such as human and synthetic languages. This typically requires a large amount of data; however, we propose a new approach which is based on a new rank-based analytic Zipf–Mandelbrot–Li probabilistic model. Unlike previous approaches, which do not consider the nature of the probability distribution in relation to language; here, we introduce a novel analytic Zipfian model which includes linguistic constraints. This provides more accurate distributions for natural sequences such as natural or synthetic emergent languages. Results are given which indicates the performance of the proposed ZML model. We derive an entropy estimation method which incorporates the linguistic constraint-based Zipf–Mandelbrot–Li into a new non-equiprobable coincidence counting algorithm which is shown to be effective for tasks such as entropy rate estimation with limited data.

Highlights

  • Academic Editors: Sergio Cruces, Iván Durán-Díaz, RubénMartín-Clemente and Andrzej CichockiReceived: 15 July 2021Accepted: 19 August 2021Published: 24 August 2021Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.Natural systems such as language, can be understood in terms of symbolic sequences described within an information-theoretic framework, where meaning is encoded through the arrangement of probabilistic elements

  • “u”; (c) “s” never follows “x”; and (d) words never end in “v” or “j”. These could potentially be considered as priors in a model, and there are other aspects of interaction in human communication and natural languages beyond linguistics which could be considered as statistical principles (For convenience we refer to these broadly as linguistic constraints and note that they may be related to verbal or written language.)to include in a model

  • A word trigram method which used the cross-entropy between this model and a balanced sample of English text trained on a language model of 583 million symbols was applied to Form C of the Brown corpus which yielded an upper bound entropy rate estimate of 1.75 bpc [74]

Read more

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. When placed in a mathematical framework, we can characterize and begin to understand the meaning of messages, on the basis of the meaning directly attached to words, but on the statistical characteristics of symbols Using this approach, natural language can be viewed as observing one or more discrete random variables X of a sequence X = X1 , . The model “hint” that we introduce is the idea that for many natural sequences including language, instead of a naive estimator, the probabilistic distribution of symbols is expected to follow linguistic patterns. The basis for our proposed approach is to develop an analytic rank-based Zipfianstyle probabilistic model which is constrained to accommodate the linguistic features of human language and to incorporate this into an efficient non-equiprobable coincidence counting the entropy estimation algorithm.

Coincidence Counting Approach
Linguistic Entropy Estimation
Remarks on Bias and Convergence Properties
Limitations of Zipfian Models for Language
Unconstrained Rank-Ordered Probabilistic Model
Constrained Linguistic Probabilistic Model
Constrained Linguistic Probabilistic Model II
Constrained Linguistic ZML Model for Natural Language
Entropy Rate Estimation
Convergence of Constrained cZML Entropy Estimation Algorithm
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call