Abstract

ABSTRACT This article presents the Twitter Corpus of English in Hong Kong (TCOEHK): a 123-million-word corpus derived from sampling tweets across the 18 districts and three geographical (macro-)regions of Hong Kong from 2010 to 2022. It introduces the corpus and demonstrates its utility by examining four linguistic variables found in English in Hong Kong (EngHK) and the dominant variety Hong Kong English (HKE): tense marking, ‘-ize/-ise’ suffix use, adverb syntactic position, and copula (non-)use. It explores their relationship with intralinguistic, stylistic (e.g. formality), and extralinguistic factors (e.g. region, year, affect). The findings show that the distribution of variants in all four variables (e.g. rates of -ize use) is similar to the patterns identified in prior HKE work. In addition to confirming previous research, the results also reveal how intralinguistic, stylistic, and extralinguistic factors can each influence the distribution of variants differently depending on the variable studied, highlighting the complex and ever-changing nature of EngHK. The availability of social metadata and the large size of the TCOEHK make it viable for examining (socio)linguistic variation and changes in contemporary (Twitter-style) EngHK, as well as potential regional and social sub-varieties/styles within EngHK. It promises to advance research on variation and change in EngHK.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call