Further Beyond Zipf's Law: Fit of a Quadratic Form to the Frequency-Rank Distribution for the Brown Corpus and Other Commonly Available English Language Corpora

Graham L Giller

doi:10.2139/ssrn.2228355

Graham L Giller

https://doi.org/10.2139/ssrn.2228355

Copy DOI

Export

Save

Cite

Journal: SSRN Electronic Journal

Publication Date: Mar 5, 2013

Abstract
Full-Text
Similar Papers

Abstract

Listen

A quadratic extension to the Zipf-Mandelbrot Law is fitted to frequency-rank plots for a series of commonly available English language corpora: specifically the Brown, Reuters, Genesis, State of the Union and Movie Reviews corpora available in the Python Natural Language Tool Kit (NLTK) package. In all cases, a quadratic form for the Zipfian frequency-rank relationship is chosen as the best approximating model to the data over the simpler Zipf and Zipf- Mandelbrot Laws. A naive pseudoword set is generated by a bootstrap method to estimate the sampling distributions of the estimated parameters, from which we may conclude that the observed curvature of the frequency-rank plot is not consistent with an unordered random symbol concatenation method as described by Li. A relationship between the lexical diversity of the corpora, as estimated by the Guiraud Index of the relative frequency of types and tokens, and the ob- served curvature of the frequency-rank plot is present (although for the number of copora studied here the relationship is not significant).

Full Text