From Lexical Bundles to Surprisal and Language Models: measuring the idiom principle on native and learner language

Gerold Schneider ,Gintaré Grigonyté

doi:10.5167/uzh-151298

Abstract

We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We rst show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information- theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse di erences between genres of native language use, and learner language at di erent levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open- choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension di culty. Our goal to abstract away from word sequences also leads us to language models as models of processing, rst in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely con rmed, we also observe that advanced learners bundle most, and that scienti c language may show lower surprisal than spoken language.

Full Text