Scaled measurement of geographic and social speech data

W A Kretzschmar,B A Kretzschmar,I M Brockman

doi:10.1093/llc/fqs058

Abstract

One of the principle signs that speech is a complex system is the nonlinear arrangement of frequencies of variants in linguistic survey data. When the counts are charted by frequency, they form an asymptotic hyperbolic curve (A-curve) at every scale of analysis. The shape of the curve is sensitive to sample size: a small sample is unlikely to show an A-curve. So, too, categorization: too large a number of categories makes the data appear linear because of the small number of tokens in each category, while allowing too few categories, such as the two data points from binary categories, also gives us a line, not a curve. The A-curve can only be observed when the number of categories into which the data are sorted lies between these two extremes. Common practice in dialectology and sociolinguistics has been to establish a small number of possible categories such as phonemes for pronunciation, or to notice only the few most frequently occurring variants and to ignore the rest. Such methods cannot address the underlying complexity of the data. In this essay, we discuss the Gini coefficient, used in economics, as a means to measure optimal nonlinearity. In an experiment where pronunciation data from survey research on the American English vowel system are analyzed in various subsamples, we demonstrate that A-curves do exist in the data in all cases, and we establish parameters for the interaction of sample size and number of categories in the design of valid and reliable experiments.

Full Text