On over- and underuse in learner corpus research and multifactoriality in corpus linguistics more generally

Stefan Th Gries

doi:10.1075/jsls.00005.gri

Abstract

AbstractThis paper critically discusses how corpus linguistics in general, but learner corpus research in particular, has been dealing with all sorts of frequency data in general, but over- and underuse frequencies in particular. I demonstrate on the basis of learner corpus data the pitfalls of using aggregate data and lacking statistical control that much work is unfortunately characterized by. In fact, I will demonstrate that monofactorial methods have very little to offer at all to research on observational data. While this paper is admittedly very didactic and methodological, I think the discussion of the empirical data offered here – a reanalysis of previously published work – shows how misleading many studies potentially and provides far-reaching implications for much of corpus linguistics and learner corpus research. Ideally/maximally, this paper together with Paquot & Plonsky (2017,Intntl. J. of Learner Corpus Research) would lead to a complete revision of how learner corpus linguists use quantitative methods and study over-/underuse; minimally, this paper would stimulate a much-needed discussion of currently lacking methodological sophistication.

Full Text