Abstract We present a corpus-based method — Variation-Based Distance and Similarity Modeling (VADIS) — that calculates distances between registers as a function of the extent to which the probabilistic conditioning of variation differs across registers. When language users have a choice between different ways of saying similar things (e.g., cut off the tops versus cut the tops off), what is the extent to which these choices are regulated differently in different registers? In this spirit, we re-analyze pre-existing datasets that cover the genitive, dative, and particle placement alternations in the grammar of English. These datasets cover five broad register categories: spoken informal English, spoken formal English, written informal English, written formal English, and online/web-based English. Analysis shows that (a) the registers under analysis are relatively but not entirely homogeneous in terms of the probabilistic grammars conditioning grammatical choices, and (b) more often than not we see a split between spoken and written registers.
Read full abstract