Abstract

Performance assessments, in which human raters assess examinee performance in practical tasks, have attracted much attention in various assessment contexts involving measurement of higher-order abilities. However, difficulty persists in that ability measurement accuracy strongly depends on rater and task characteristics such as rater severity and task difficulty. To resolve this problem, various item response theory (IRT) models incorporating rater and task parameters, including many-facet Rasch models (MFRMs), have been proposed. When applying such IRT models to datasets comprising results of multiple performance tests administered to different examinees, test linking is needed to unify the scale for model parameters estimated from individual test results. In test linking, test administrators generally need to design multiple tests such that raters and tasks partially overlap. The accuracy of linking under this design is highly reliant on the numbers of common raters and tasks. However, the numbers of common raters and tasks required to ensure high accuracy in test linking remain unclear, making it difficult to determine appropriate test designs. We therefore empirically evaluate the accuracy of IRT-based performance-test linking under common rater and task designs. Concretely, we conduct evaluations through simulation experiments that examine linking accuracy based on a MFRM while changing numbers of common raters and tasks with various factors that possibly affect linking accuracy.

Highlights

  • With the increasing need for measuring higher-order abilities such as logical thinking and problem-solving, performance assessments, in which human raters assess examinee performance on practical tasks, have attracted attention (Rosen & Tager, 2014; Liu, Frankel, & Roohr, 2014; Bernardin, Thomason, Buckley, & Kane, 2016; Abosalem, 2016; Schendel & Tolmie, 2017; Uto & Ueno, 2018)

  • One limitation of performance assessments is that their accuracy for ability measurement strongly depends on rater and task characteristics such as rater severity and task difficulty (Kassim, 2011; Myford & Wolfe, 2003; Eckes, 2005; 2015; Bernardin et al, 2016)

  • These findings suggest that the extent to which IRTbased performance-test linking requires common raters and tasks depends basically on the following factors: 1. distributions of examinee ability and characteristics of raters and tasks, 2. numbers of examinees, raters, and tasks, and 3. rates of missing data

Read more

Summary

Introduction

With the increasing need for measuring higher-order abilities such as logical thinking and problem-solving, performance assessments, in which human raters assess examinee performance on practical tasks, have attracted attention (Rosen & Tager, 2014; Liu, Frankel, & Roohr, 2014; Bernardin, Thomason, Buckley, & Kane, 2016; Abosalem, 2016; Schendel & Tolmie, 2017; Uto & Ueno, 2018). Previous research related to traditional IRT-based linking for objective tests has reported that the required extent of commonality depends on the distributions of examinee ability and item characteristics, the numbers of examinees and items, and the accuracy of model parameter estimation (Kilmen and Demirtasli, 2012; Uysal & Ibrahim, 2016; Joo, Lee, & Stark, 2017). The linking method used here is a simple extension of the fixed common item parameters method, a common method in IRT-based objective test linking (Arai & Mayekawa, 2011; Jodoin, Keller, & Swaminathan, 2003; Li, Tam, & Tompkins, 2004)because it estimates the new test parameters while fixing parameters for common raters and tasks In this design, linking accuracy is strongly dependent on the numbers of shared raters and tasks (Linacre, 2014). Note that alternative approaches for evaluating linking accuracy, such as that in Linacre (1998), may be possible if we use other linking methods, such as scale transformation methods with separate calibration or concurrent calibration methods (Kolen & Brennan, 2014; Arai & Mayekawa, 2011; Jodoin et al, 2003; Ryan & Rockmann, 2009), instead of the fixed rater and task parameters method

Experiments
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call