Abstract

A quadratic extension to the Zipf-Mandelbrot Law is fitted to frequency-rank plots for a series of commonly available English language corpora: specifically the Brown, Reuters, Genesis, State of the Union and Movie Reviews corpora available in the Python Natural Language Tool Kit (NLTK) package. In all cases, a quadratic form for the Zipfian frequency-rank relationship is chosen as the best approximating model to the data over the simpler Zipf and Zipf- Mandelbrot Laws. A naive pseudoword set is generated by a bootstrap method to estimate the sampling distributions of the estimated parameters, from which we may conclude that the observed curvature of the frequency-rank plot is not consistent with an unordered random symbol concatenation method as described by Li. A relationship between the lexical diversity of the corpora, as estimated by the Guiraud Index of the relative frequency of types and tokens, and the ob- served curvature of the frequency-rank plot is present (although for the number of copora studied here the relationship is not significant).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call