Validating word lists that represent learner knowledge in EFL contexts: The impact of the definition of word and the choice of source corpora

Geoffrey G Pinchbeck,Dale Brown,Stuart Mclean,Brandon Kramer

doi:10.1016/j.system.2022.102771

Abstract

While word-frequency lists have been commonly used as indexes of word usefulness, their role as a proxy for learner word knowledge is unclear. Word knowledge in a structured sample (N = 625) of Japanese university-level EFL learners, operationalized using dichotomous Rasch modeling of test-item data, was used as an external reference criterion to investigate two issues germane to the development of word lists representing learner knowledge in EFL contexts: 1) the definition of word and 2) the choice of reference corpus. On the former, corpus-derived, word-frequency lists based on either word orthographic forms, flemmas, or word families were generated from 18 different corpora. Word-frequency lists using flemma-based word groupings resulted in higher correlations with learner population word knowledge as compared with those using word-family-based groupings across all 18 sets of word lists tested. On the latter, lists derived from corpora of spontaneous speech, fictional TV/movies for younger viewers, and narrative written texts consistently showed higher correlations with word knowledge than those derived from non-conversational speech, or any non-fiction written text genre. These results suggest that mega-corpora compiled from conveniently available electronic written texts may not be ideal as scales for diagnostic vocabulary testing or as indexes used in readability formulae.

Full Text