Abstract
We propose a novel approach to modelling rater effects in scoring-based assessment. The approach is based on a Bayesian hierarchical model and simulations from the posterior distribution. We apply it to large-scale essay assessment data over a period of 5 years. Empirical results suggest that the model provides a good fit for both the total scores and when applied to individual rubrics. We estimate the median impact of rater effects on the final grade to be ± 2 points on a 50 point scale, while 10% of essays would receive a score at least ± 5 different from their actual quality. Most of the impact is due to rater unreliability, not rater bias.
Highlights
Performance assessment plays a fundamental role in society, especially in education, where it is common practice to test students and base their assessment on test scores
Decisions that follow from performance assessment often have profound consequences for those involved, such as allocation of funding, promotion, and, as is the case with the essay scoring data used in this study, enrolment into higher-educational programs
The proposed model provides a good fit for both the total score when extended to individual rubrics, despite substantial semantic differences in rubrics and subsequently in their scales and distributions
Summary
Performance assessment plays a fundamental role in society, especially in education, where it is common practice to test students and base their assessment on test scores. Decisions that follow from performance assessment often have profound consequences for those involved, such as allocation of funding, promotion, and, as is the case with the essay scoring data used in this study, enrolment into higher-educational programs. Some skills can be assessed with relatively structured tasks, but writing skills and language proficiency are typically assessed with essays, unstructured essay-type questions, or other types of free-response tasks. Scoring these types of performance assessments relies heavily on human judgement. Scores can be affected by factors not related to the ability being assessed, such as bias (strictness, leniency) and (un)reliability (non-systematic error) of the rater
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have