Abstract Variationist sociolinguistic methodology is grounded in the principle of accountability, which requires researchers to identify all of the contexts in which a given variable occurs or fails to occur. For morphosyntactic, lexical, and discourse variables, this process is notoriously time- and labor-intensive, as researchers manually sift through raw data in search of tokens to analyze. In this article, we demonstrate the usability of pretrained computational language models to automatically identify tokens of sociolinguistic variables in raw text. We focus on two English-language variables from different linguistic domains: intensifier choice (lexical; e.g., she is {very, really, so} smart) and complementizer selection (morphosyntactic; e.g., they thought {that, Ø} I understood). Text classifiers built with Bidirectional Encoder Representations from Transformers (BERT) achieve high precision and recall metrics for both variables, even with relatively little hand-annotated training data. Our findings suggest that computational language models can dramatically reduce the burden of preparing data for variationist analysis. Furthermore, by inspecting the classifiers’ scores for individual sentences, researchers can observe patterns that should be written into the description of the variable context for further study.
Read full abstract