Interrater Reliability Estimators Commonly Used in Scoring Language Assessments: A Monte Carlo Investigation of Estimator Accuracy

Grant B Morgan,Min Zhu,Robert L Johnson,Kari J Hodge

doi:10.1080/15434303.2014.937486

Abstract

Common estimators of interrater reliability include Pearson product-moment correlation coefficients, Spearman rank-order correlations, and the generalizability coefficient. The purpose of this study was to examine the accuracy of estimators of interrater reliability when varying the true reliability, number of scale categories, and number of essays rated. This research used Monte Carlo methods to draw samples from known population models to examine the accuracy of select estimators of interrater reliability between two raters. In addition to the estimates shown above, we included the polychoric correlation coefficient based on its alignment with the context in which student language assessments are rated. Although each estimator produced an estimate close to the population parameter, polychoric correlations provided the closest estimates with mean and median bias equal to 0.00 (SD = 0.05) across all conditions. The use of Pearson product-moment and Spearman rank-order correlation coefficients might result in the underestimation of interrater reliability by as much as a third.

Full Text