Managing rater effects through the use of FACETS analysis: the case of a university placement test

Siew Mei Wu,Susan Tan

doi:10.1080/07294360.2015.1087381

Abstract

ABSTRACTRating essays is a complex task where students' grades could be adversely affected by test-irrelevant factors such as rater characteristics and rating scales. Understanding these factors and controlling their effects are crucial for test validity. Rater behaviour has been extensively studied through qualitative methods such as questionnaires and think aloud protocols and quantitatively through the use of the multi-faceted Rasch model (MFRM) [Congdon, P.J., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37(2), 163–178; Engelhard, G. (1992). The measurement of writing ability with a multi-faceted Rasch model. Applied Measurement in Education, 5(3), 171–191; Lumley, T., & McNamara, T.F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(54), 54–71; Weigle, S.C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287]. While these studies have yielded a rich understanding of rater characteristics and rating, less is known about the use of quantitative analysis to help manage and make adjustments for differences in students' scores. This study uses the MFRM [Linacre, J.M. (1989). Multi-faceted Rasch measurement. Chicago: MESA Press] to investigate raters' scoring behaviour and ascertain how it affects students' scores in a large-scale placement test. It proposes the use of the anchoring method within the MFRM to manage the placement of students where it is not possible to have all raters score all scripts. The analysis shows that raters, while mostly internally consistent, have different levels of severity despite training. These differences would significantly affect a student's placement in the test if no measures are taken to manage this problem. The MFRM also shows that a few raters may be scoring the essays in a more holistic manner over time probably due to the halo effect [Engelhard, G. (1998). Evaluating the quality of ratings obtained from standard-setting judges. Educational and Psychological Measurement, 58(2), 179–196]. The study demonstrates how the MFRM can reveal patterns in raters' scoring and most importantly the analysis yields data that allow targeted strategies to handle the practical issue of moderation of scores to manage rater differences.

Full Text