This article investigates how measurement models and statistical procedures can be applied to estimate the accuracy of proficiency classification in language testing. The paper starts with a concise introduction of four measurement models: the classical test theory (CTT) model, the dichotomous item response theory (IRT) model, the testlet response theory (TRT) model, and the polytomous item response theory (Poly-IRT) model. Following this, two classification procedures are presented: the Livingston and Lewis method for CTT and the Rudner method for the three IRT-based models. The utility of these models and procedures are then evaluated by examining the accuracy of classifying 5000 language test takers from a large-scale language certification examination into two proficiency categories. The most important finding is that the testlet format (multiple questions based on one prompt), which language tests usually rely on, has a great impact on the proficiency classification. All testlets in this study show a strong testlet effect. Hence, the TRT model is recommended for proficiency classification. Using the standard IRT model would inflate the classification accuracy due to the underestimated measurement error. Meanwhile, using the Poly-IRT model would give slightly less accurate classification results. Concerning the CTT model, while its classification accuracy is comparable to that of the TRT, there exists considerable inconsistency between their classification results.
Read full abstract