Methodological differences matter: Identification thresholds and corpus composition in lexical bundle research

Fan Pan

doi:10.2989/16073614.2020.1858897

Abstract

In lexical bundle research, it has been a common practice to extract and compare lexical bundles across different corpora based on certain identification thresholds. This line of study adopts varying frequency and dispersion thresholds because the corpora compared always differ in the sizes and/or the numbers of texts. However, few studies have ever considered the consequences of these methodological differences. To bridge the gap, a series of experiments were conducted to explore the impact of identification thresholds and corpus composition on bundle extraction and the results of cross-corpora comparison. The first set of experiments demonstrated that different identification thresholds applied to the same pair of corpora may yield conflicting results, which indicated that the methodological differences could be one source of mixed results in the literature. Further, after removing the influence of differences in the sizes and/or the numbers of texts, the second set of experiments revealed that increasing the dispersion thresholds proportionally to offset the differences in the numbers of texts actually favours the corpus with a smaller number of texts. This study highlighted the interactive relationship between frequency thresholds and dispersion thresholds and the key role of dispersion thresholds in filtering bundles. The article also discusses the methodological implications for future contrastive lexical bundle research.

Full Text