Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences.

Andrew D Back,Janet Wiles

doi:10.3390/e23091100

Abstract

Entropy estimation faces numerous challenges when applied to various real-world problems. Our interest is in divergence and entropy estimation algorithms which are capable of rapid estimation for natural sequence data such as human and synthetic languages. This typically requires a large amount of data; however, we propose a new approach which is based on a new rank-based analytic Zipf–Mandelbrot–Li probabilistic model. Unlike previous approaches, which do not consider the nature of the probability distribution in relation to language; here, we introduce a novel analytic Zipfian model which includes linguistic constraints. This provides more accurate distributions for natural sequences such as natural or synthetic emergent languages. Results are given which indicates the performance of the proposed ZML model. We derive an entropy estimation method which incorporates the linguistic constraint-based Zipf–Mandelbrot–Li into a new non-equiprobable coincidence counting algorithm which is shown to be effective for tasks such as entropy rate estimation with limited data.

Highlights

Academic Editors: Sergio Cruces, Iván Durán-Díaz, RubénMartín-Clemente and Andrzej CichockiReceived: 15 July 2021Accepted: 19 August 2021Published: 24 August 2021Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.Natural systems such as language, can be understood in terms of symbolic sequences described within an information-theoretic framework, where meaning is encoded through the arrangement of probabilistic elements
“u”; (c) “s” never follows “x”; and (d) words never end in “v” or “j”. These could potentially be considered as priors in a model, and there are other aspects of interaction in human communication and natural languages beyond linguistics which could be considered as statistical principles (For convenience we refer to these broadly as linguistic constraints and note that they may be related to verbal or written language.)to include in a model
A word trigram method which used the cross-entropy between this model and a balanced sample of English text trained on a language model of 583 million symbols was applied to Form C of the Brown corpus which yielded an upper bound entropy rate estimate of 1.75 bpc [74]

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. When placed in a mathematical framework, we can characterize and begin to understand the meaning of messages, on the basis of the meaning directly attached to words, but on the statistical characteristics of symbols Using this approach, natural language can be viewed as observing one or more discrete random variables X of a sequence X = X1 , . The model “hint” that we introduce is the idea that for many natural sequences including language, instead of a naive estimator, the probabilistic distribution of symbols is expected to follow linguistic patterns. The basis for our proposed approach is to develop an analytic rank-based Zipfianstyle probabilistic model which is constrained to accommodate the linguistic features of human language and to incorporate this into an efficient non-equiprobable coincidence counting the entropy estimation algorithm.

Coincidence Counting Approach

Linguistic Entropy Estimation

Remarks on Bias and Convergence Properties

Limitations of Zipfian Models for Language

Unconstrained Rank-Ordered Probabilistic Model

Constrained Linguistic Probabilistic Model

Constrained Linguistic Probabilistic Model II

Constrained Linguistic ZML Model for Natural Language

Entropy Rate Estimation

Convergence of Constrained cZML Entropy Estimation Algorithm

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Entropy (Basel, Switzerland)	Publication Date: Aug 24, 2021
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy (Basel, Switzerland)

Lead the way for us

Similar Papers

Estimating Sentence-like Structure in Synthetic Languages Using Information Topology.
Andrew D Back ... Janet Wiles
Entropy (Basel, Switzerland) | VOL. 24
Andrew D Back, et. al.Andrew D Back ... Janet Wiles
22 Jun 2022
Entropy (Basel, Switzerland) | VOL. 24

An Information Theoretic Approach to Symbolic Learning in Synthetic Languages.
Andrew D Back ... Janet Wiles
Entropy (Basel, Switzerland) | VOL. 24
Andrew D Back, et. al.Andrew D Back ... Janet Wiles
10 Feb 2022
Entropy (Basel, Switzerland) | VOL. 24

Entropy rate estimation for vector processes: Application to complex FMRI analysis
Wei Du ... Geng-Shen Fu
-
Wei Du, et. al.Wei Du ... Geng-Shen Fu
01 Oct 2014
01 Oct 2014

The development of stemming algorithm for the Uzbek language

Кибернетика и программирование | VOL. -

01 Jan 2020
Кибернетика и программирование | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy (Basel, Switzerland)