Abstract

The present study discusses the relevance of measures of lexical diversity (LD) to the assessment of learner corpora. It also argues that existing measures of LD, many of which have become specialized for use with language corpora, are fundamentally measures of lexical repetition, are based on an etic perspective of language, and lack construct validity. The proposed solution draws from Zipf’s (1935) emic perspective of language, which views LD as a matter of perception, but which also assumes that competent speakers of a common language share similar perceptions. The present study tests whether this is true and specifically whether untrained human raters will show high levels of inter-rater reliability in their judgments of the levels of LD found in 60 texts extracted from a corpus of narratives written in English by a mix of language learners and native speakers. The results confirm Zipf’s assertion, but also indicate that a relatively large number of motivated raters are needed to demonstrate this tendency. The remainder of the study discusses the implications these results have for the development of an automated measure of LD to be used with learner corpora. The proposed method begins with human judgments of a representative subsample of a corpus, proceeds to a statistical model of objective measures that accurately predicts the human judgments, and ends with a multidimensional, corpus-specific automated measure that outputs reliable estimates of how a reliable group of human judges would rate the levels of LD in the texts of that corpus.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call