Abstract
Sentence formation is a highly structured, history-dependent, and sample-space reducing (SSR) process. While the first word in a sentence can be chosen from the entire vocabulary, typically, the freedom of choosing subsequent words gets more and more constrained by grammar and context, as the sentence progresses. This sample-space reducing property offers a natural explanation of Zipf's law in word frequencies, however, it fails to capture the structure of the word-to-word transition probability matrices of English text. Here we adopt the view that grammatical constraints (such as subject-predicate-object) locally re-order the word order in sentences that are sampled by the word generation process. We demonstrate that superimposing grammatical structure-as a local word re-ordering (permutation) process-on a sample-space reducing word generation process is sufficient to explain both, word frequencies and word-to-word transition probabilities. We compare the performance of the grammatically ordered SSR model in reproducing several test statistics of real texts with other text generation models, such as the Bernoulli model, the Simon model, and the random typewriting model.
Highlights
After almost a century of work, understanding statistical regularities in language is still work in progress
What is true for all test-statistics we considered is that grammatically ordered” SSR (goSSR) models show values of test-statistics that are typically located between real text and the Bernoulli model
Our results show that understanding the statistics of streams of English texts as a result of a grammatical ordering process, superimposed on a sample-space reducing (SSR) word selection process, is consistent with the evidence
Summary
After almost a century of work, understanding statistical regularities in language is still work in progress. Maybe the most striking statistical feature is that rank ordered distributions of word frequencies follow an approximate power law, f ðrÞ rÀ a ; ð1Þ where r is the rank assigned to every word in a given text; the most frequent word has rank one, the second most frequent has rank two, etc. For most word-based texts, one finds α * 1, independent of language, genre, and time of writing. This “universal” feature is called Zipf’s law [1].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.