Abstract

ABSTRACT Formulaic sequences in language use are often studied by means of the automatic identification of frequently recurring series of words, often referred to as ‘lexical bundles’, in corpora that contrast different registers, academic disciplines, etc. As corpora often differ in size, a critically important assumption in this field states that the use of a normalized frequency threshold, such as 20 occurrences per million words, allows for an accurate comparison of corpora of different sizes. Yet, several researchers have argued that normalization may be unreliable when applied to frequency threshold. The study investigates this issue by comparing the number of lexical bundles identified in corpora that differ only in size. Using two complementary random sampling procedures, subcorpora of 100,000 to two million words were extracted from five corpora, with lexical bundles identified in them using two normalized frequency thresholds and two dispersion thresholds. The results show that many more lexical bundles are identified in smaller subcorpora than in larger ones. This size effect can be related to the Zipfian nature of the distribution of words and word sequences in corpora. The conclusion discusses several solutions to avoid the unfairness of comparing lexical bundles identified in corpora of different sizes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call