Abstract

Physical manifestations of linguistic units include sources of variability due to factors of speech production which are by definition excluded from counts of linguistic symbols. In this work, we examine whether linguistic laws hold with respect to the physical manifestations of linguistic units in spoken English. The data we analyse come from a phonetically transcribed database of acoustic recordings of spontaneous speech known as the Buckeye Speech corpus. First, we verify with unprecedented accuracy that acoustically transcribed durations of linguistic units at several scales comply with a lognormal distribution, and we quantitatively justify this ‘lognormality law’ using a stochastic generative model. Second, we explore the four classical linguistic laws (Zipf’s Law, Herdan’s Law, Brevity Law and Menzerath–Altmann’s Law (MAL)) in oral communication, both in physical units and in symbolic units measured in the speech transcriptions, and find that the validity of these laws is typically stronger when using physical units than in their symbolic counterpart. Additional results include (i) coining a Herdan’s Law in physical units, (ii) a precise mathematical formulation of Brevity Law, which we show to be connected to optimal compression principles in information theory and allows to formulate and validate yet another law which we call the size-rank law or (iii) a mathematical derivation of MAL which also highlights an additional regime where the law is inverted. Altogether, these results support the hypothesis that statistical laws in language have a physical origin.

Highlights

  • Physical manifestations of linguistic units include sources of variability due to factors of speech production which are by definition excluded from counts of linguistic symbols

  • We verify with unprecedented accuracy that acoustically transcribed durations of linguistic units at several scales comply with a lognormal distribution, and we quantitatively justify this ‘lognormality law’ using a stochastic generative model

  • Notable patterns which are nowadays widely recognized include Zipf’s Law which addresses the rank-frequency plot of linguistic units, Herdan’s Law on the sublinear vocabulary growth in a text, the Brevity Law which highlights the tendency of more abundant linguistic units to be shorter, or the so-called Menzerath–Altmann Law (MAL) which points to a negative correlation between the size of a construct and the size of its constituents

Read more

Summary

Introduction

Physical manifestations of linguistic units include sources of variability due to factors of speech production which are by definition excluded from counts of linguistic symbols. A given word or sentence can be spoken in different ways, with different intonations, and its duration admits a certain variability [8] that could have semantic consequences [17] These variations cannot be explained—by construction—using symbolic language representations, and one would not expect physical measures to follow the linguistic laws without an additional explanation. To address this important issue, here we have conducted a systematic exploration of linguistic laws in a large corpus of spoken English (Buckeye corpus) [18,19] which has been previously manually segmented, having access at the same time to both (i) symbolic linguistic units (the transcription of phonemes, words and breath-groups (BG), defined by pauses in the speech for breathing or longer and (ii) the physical quantities attached to each of these units, which altogether allow a parallel exploration of statistical patterns of oral communication in both the actual physical signal and its text transcription

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call