Abstract

The Menzerath law is considered to show an aspect of the complexity underlying natural language. This law suggests that, for a linguistic unit, the size (y) of a linguistic construct decreases as the number (x) of constructs in the unit increases. This article investigates this property syntactically, with x as the number of constituents modifying the main predicate of a sentence and y as the size of those constituents in terms of the number of words. Following previous articles that demonstrated that the Menzerath property held for dependency corpora, such as in Czech and Ukrainian, this article first examines how well the property applies across languages by using the entire Universal Dependency dataset ver. 2.3, including 76 languages over 129 corpora and the Penn Treebank (PTB). The results show that the law holds reasonably well for . Then, for comparison, the property is investigated with syntactically randomized sentences generated from the PTB. These results show that the property is almost reproducible even from simple random data. Further analysis of the property highlights more detailed characteristics of natural language.

Highlights

  • Menzerath’s Law in the Syntax ofThe theme of this article is the Menzerath law of syntactic structure, which has been considered to demonstrate some of the complexity underlying natural language

  • The left graph is for random sentences, while the right graph is for the Penn Treebank (PTB)

  • Given that natural language has a similar difference in the ranges of x and sentence lengths, i.e., that it lacks shorter sentences, we cannot completely deny that the Menzerath property of natural language is partly produced by a similar statistical effect

Read more

Summary

Introduction

The theme of this article is the Menzerath law of syntactic structure, which has been considered to demonstrate some of the complexity underlying natural language. One method to study the reasons for such a phenomenon is to formulate it mathematically The first such functional formulation was proposed by Altmann [5], and the property is often called the Menzerath–Altmann law. There have been indications that the Menzerath property holds for syntactic structures in language [21,22,23,24] Those authors suggested measuring the mean size of the main constituents of a sentence (y) with respect to the number of main constituents. The papers showed how the Menzerath property held for the authors’ respective mother tongues It is unknown how well their findings apply across other languages. This article applies a new idea of using random dependency sentences generated from the PTB Such randomized analysis enables consideration of how natural language text differs from random data. The detailed analysis clearly shows some aspects of natural language that are different from those of random data, which suggests that further study of the Menzerath property could lead to a better understanding of natural language

Formulation of the Property
Dependency Structure
Menzerath Property of Syntactically Annotated Data
Universal Dependency Dataset
Penn Treebank
Menzerath Property of Random Sentences
Generation of Random Dependency Sentences
Empirical Menzerath Property of Random Sentences
Analytical Rationale
Findings
Discussion
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call