Content matters: Measures of contextual diversity must consider semantic content

Brendan T Johns,Michael N Jones

doi:10.1016/j.jml.2021.104313

Abstract

Measures of contextual diversity seek to replace word frequency by counting the number of different contexts that a word occurs in rather than the total raw number of occurrences (Adelman, Brown, & Quesada, 2006). It has repeatedly been shown that contextual diversity measures outperform word frequency on word recognition datasets (Adelman & Brown, 2008; Brysbaert & New, 2009). Recently, Hollis (2020) demonstrated that the standard operationalization of contextual diversity as a document count accounts for relatively little unique variance over word frequency when other variables of contextual occurrences are controlled for. One aspect of the analysis conducted by Hollis (2020) that was not taken into account was the semantic content of the contexts that words occur in. Johns, Dye, and Jones (2020) and Johns (2021) have recently shown that defining linguistic contexts at larger, and more ecologically valid, levels lead to contextual diversity measures that provide very large improvements over word frequency, especially when implemented with principles from the Semantic Distinctiveness Model of Jones, Johns, and Recchia (2012). Across a series of simulations, we demonstrate that the advantages of contextual diversity measures are dependent upon the usage of semantic representations of words to determine the uniqueness of contextual occurrences, where unique contextual occurrences provide a greater impact to a word’s lexical strength than redundant contextual occurrences. The results of the simulations suggest that for better theoretical accounts of lexical strength to be developed, attention needs to be paid to the representation of linguistic contexts. Code and data associated with this article is available at https://osf.io/r5ec2/.

Full Text