Comparing Lexical Bundles across Corpora of Different Sizes: The Zipfian Problem

Yves Bestgen

doi:10.1080/09296174.2019.1566975

Abstract

ABSTRACT Formulaic sequences in language use are often studied by means of the automatic identification of frequently recurring series of words, often referred to as ‘lexical bundles’, in corpora that contrast different registers, academic disciplines, etc. As corpora often differ in size, a critically important assumption in this field states that the use of a normalized frequency threshold, such as 20 occurrences per million words, allows for an accurate comparison of corpora of different sizes. Yet, several researchers have argued that normalization may be unreliable when applied to frequency threshold. The study investigates this issue by comparing the number of lexical bundles identified in corpora that differ only in size. Using two complementary random sampling procedures, subcorpora of 100,000 to two million words were extracted from five corpora, with lexical bundles identified in them using two normalized frequency thresholds and two dispersion thresholds. The results show that many more lexical bundles are identified in smaller subcorpora than in larger ones. This size effect can be related to the Zipfian nature of the distribution of words and word sequences in corpora. The conclusion discusses several solutions to avoid the unfairness of comparing lexical bundles identified in corpora of different sizes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Comparing Lexical Bundles across Corpora of Different Sizes: The Zipfian Problem

Abstract

Talk to us

Similar Papers

More From: Journal of Quantitative Linguistics

Lead the way for us

Journal: Journal of Quantitative Linguistics	Publication Date: Feb 5, 2019
Citations: 16

Similar Papers

Specificity in English for Academic Purposes (EAP): A Corpus Analysis of Lexical Bundles in Academic Writing
Leng Hong Ang ... Kim Hua Tan
3L The Southeast Asian Journal of English Language Studies | VOL. 24
Leng Hong Ang, et. al.Leng Hong Ang ... Kim Hua Tan
29 Jun 2018
3L The Southeast Asian Journal of English Language Studies | VOL. 24

Methodological differences matter: Identification thresholds and corpus composition in lexical bundle research
Fan Pan
Southern African Linguistics and Applied Language Studies | VOL. 38
Fan PanFan Pan
01 Oct 2020
Southern African Linguistics and Applied Language Studies | VOL. 38

A Corpus Analysis Of Phraseological Sequences In Academic Writing
Ang Leng Hong*
-
Ang Leng Hong*Ang Leng Hong*
23 Sep 2019
23 Sep 2019

LEXICAL BUNDLES IN JOURNAL ARTICLES ACROSS ACADEMIC DISCIPLINES
Deny Arnos Kwary ... Almira F Artha
Indonesian Journal of Applied Linguistics | VOL. 7
Deny Arnos Kwary, et. al.Deny Arnos Kwary ... Almira F Artha
31 May 2017
Indonesian Journal of Applied Linguistics | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Comparing Lexical Bundles across Corpora of Different Sizes: The Zipfian Problem

Abstract

Talk to us

Similar Papers

More From: Journal of Quantitative Linguistics