Sociolinguistic Analysis with Missing Metadata? Leveraging Linguistic and Semiotic Resources Through Deep Learning to Investigate English Variation and Change on Twitter

Wilkinson Daniel Wong Gonzales

doi:10.1093/applin/amad086

Abstract

Abstract This paper highlights a language and sign-based computational solution to the problem of missing social metadata on Twitter (now, ‘X’): demographic prediction using Deep Learning. It aims to apply this method to variationist sociolinguistics research, illustrating how the approach can facilitate analyses with missing metadata (i.e. stylistic age and sex/gender) by deriving this metadata solely from publicly available linguistic and semiotic resources on Twitter profiles (e.g. display pictures and biographies). I use my investigations of English tweets from the Philippines and Hong Kong as case examples, examining the extent to which the use of the copula and the use of will-shall modals on social media are conditioned by diachronic factors as well as factors internal and external to language (e.g. social factors). The results reveal the influence of stylistic gender and age as well as other factors on patterns of variation. They offer a glimpse into the nuanced sociolinguistic aspects of language usage on social media, highlighting the advantages of utilizing AI-powered Deep Learning to tackle data-related challenges. The discoveries and methodology hold the possibility of influencing other fields and practical situations beyond the study of language and society.

Full Text