Monitoring Rater Quality in Observational Systems: Issues Due to Unreliable Estimates of Rater Quality

Mark White,Matt Ronfeldt

doi:10.1080/10627197.2024.2354311

Abstract

ABSTRACT Standardized observation systems seek to reliably measure a specific conceptualization of teaching quality, managing rater error through mechanisms such as certification, calibration, validation, and double-scoring. These mechanisms both support high quality scoring and generate the empirical evidence used to support the scoring inference (i.e., that scores represent the intended construct). Past efforts to support this inference assume that rater error can be accurately estimated from a few scoring occasions. We empirically test this assumption using two datasets from the Measures of Effective Teaching project. Results show that rater error is highly complex and difficult to measure precisely from a few scoring occasions. Typically, designed rater monitoring and control mechanisms likely cannot measure rater error precisely enough to show that raters can distinguish between levels of teaching quality within the range typically observed. We discuss the implications for supporting the scoring inference, including recommended changes to rater monitoring and control mechanisms.

Full Text