The possibility of improving automated calculation of measures of lexical richness for EFL writing: A comparison of the LCA, NLTK and SpaCy tools

Ryan Spring,Matthew Johnson

doi:10.1016/j.system.2022.102770

Abstract

Automatically calculating measures of lexical richness is important for L2 learning because they can be used for assessment of productive abilities and general linguistic ability. One popular tool for doing so is the Lexical Complexity Analyzer (LCA), but more advanced tools for parsing have become available since its creation. This paper compares a modified version of the LCA code run with NLTK and SpaCy, two popular natural language processing toolkits, and the online version of the LCA to calculate 26 measures of lexical richness. We show how similarly they calculate the measures and how well each of the three tools' calculations correlate with EFL writer's human-rated scores and TOEFL® ITP scores. We found that six of the measures suggested to be associated with higher oral proficiency by Lu (2012) were also highly correlated with higher human-rated scores and TOEFL® ITP scores in our data set. However, the modifications to our code that utilize a different list to determine word sophistication and allow be and have verbs to be treated as lexical verbs caused four measures which Lu (2012) found to be unassociated with proficiency to be correlated with both human-rated scores and TOEFL® ITP scores, particularly when run with SpaCy.

Full Text