Flexibility at the Price of Volatility: Concurrent Calibration in Multistage Tests in Practice Using a 2PL Model

Laura A Helbling,Stéphanie Berger,Angela Verschoor

doi:10.3389/feduc.2021.679864

Laura A Helbling, Stéphanie Berger + Show 1 more

Open Access

https://doi.org/10.3389/feduc.2021.679864

Copy DOI

Abstract

Multistage test (MST) designs promise efficient student ability estimates, an indispensable asset for individual diagnostics in high-stakes educational assessments. In high-stakes testing, annually changing test forms are required because publicly known test items impair accurate student ability estimation, and items of bad model fit must be continually replaced to guarantee test quality. This requires a large and continually refreshed item pool as the basis for high-stakes MST. In practice, the calibration of newly developed items to feed annually changing tests is highly resource intensive. Piloting based on a representative sample of students is often not feasible, given that, for schools, participation in actual high-stakes assessments already requires considerable organizational effort. Hence, under practical constraints, the calibration of newly developed items may take place on the go in the form of a concurrent calibration in MST designs. Based on a simulation approach this paper focuses on the performance of Rasch vs. 2PL modeling in retrieving item parameters when items are for practical reasons non-optimally placed in multistage tests. Overall, the results suggest that the 2PL model performs worse in retrieving item parameters compared to the Rasch model when there is non-optimal item assembly in the MST; especially in retrieving parameters at the margins. The higher flexibility of 2PL modeling, where item discrimination is allowed to vary, seems to come at the cost of increased volatility in parameter estimation. Although the overall bias may be modest, single items can be affected by severe biases when using a 2PL model for item calibration in the context of non-optimal item placement.

Highlights

Multistage test (MST) designs promise efficient student ability estimates by adaptive testing (Hendrickson, 2007; Yan et al, 2014)
MST designs have been widely used in practice as they allow for efficient student ability estimates (e.g., Yan et al, 2014), while not requiring the huge item pool a computer adaptive test (CAT; e.g., van der Linden and Glas, 2010) needs, meeting resource constraints in practice (Berger et al, 2019)
To ensure test quality in practice, changing MST forms is required because publicly known test items impair accurate student ability estimation, and items of bad model fit need to be continually replaced by newly developed items

Summary

Introduction

Multistage test (MST) designs promise efficient student ability estimates by adaptive testing (Hendrickson, 2007; Yan et al, 2014). MST designs consist of several parts (i.e., stages), which, in turn, include multiple item sets—called modules—of varying difficulty (Zenisky et al, 2010; Yan et al, 2014). The students are routed based on their performance (i.e., preliminary ability estimates or number of correct items) to item sets of difficulty that match the range of their abilities. Compared to a linear test, this procedure allows for estimating student abilities more precisely (e.g., Yan et al, 2014) and prevents students from becoming discouraged; it assesses their skills based on items which

Objectives

Methods

Results

Conclusion