With increasing availability of digital text, there has been an explosion of computational methods designed to turn patterns of word co-occurrence in large text corpora into numerical scores expressing the “semantic distance” between any two words. The success of such methods is typically evaluated by how well they predict human judgments of similarity. Here, I examine how well corpus-based methods predict amplitude of the N400 component of the event-related potential (ERP), an online measure of lexical processing in brain electrical activity. ERPs elicited by the second words of 303 word pairs were analyzed at the level of individual items. Three corpus-based measures (mutual information, distributional similarity, and latent semantic analysis) were compared to a traditional measure of free association strength. In a regression analysis, corpus-based and free association measures each explained some of the variance in N400 amplitude, suggesting that these may tap distinct aspects of word relationships. Lexical factors of concreteness of word meaning, word frequency, number of semantic associates, and orthographic similarity also explained variance in N400 amplitude at the single-item level.
Read full abstract