Linking of Rasch-Scaled Tests: Consequences of Limited Item Pools and Model Misfit

Luise Fischer,Timo Gnambs,Claus H Carstensen,Theresa Rohm

doi:10.3389/fpsyg.2021.633896

Luise Fischer, Timo Gnambs + Show 2 more

Open Access

https://doi.org/10.3389/fpsyg.2021.633896

Copy DOI

Abstract

In the context of item response theory (IRT), linking the scales of two measurement points is a prerequisite to examine a change in competence over time. In educational large-scale assessments, non-identical test forms sharing a number of anchor-items are frequently scaled and linked using two− or three-parametric item response models. However, if item pools are limited and/or sample sizes are small to medium, the sparser Rasch model is a suitable alternative regarding the precision of parameter estimation. As the Rasch model implies stricter assumptions about the response process, a violation of these assumptions may manifest as model misfit in form of item discrimination parameters empirically deviating from their fixed value of one. The present simulation study investigated the performance of four IRT linking methods—fixed parameter calibration, mean/mean linking, weighted mean/mean linking, and concurrent calibration—applied to Rasch-scaled data with a small item pool. Moreover, the number of anchor items required in the absence/presence of moderate model misfit was investigated in small to medium sample sizes. Effects on the link outcome were operationalized as bias, relative bias, and root mean square error of the estimated sample mean and variance of the latent variable. In the light of this limited context, concurrent calibration had substantial convergence issues, while the other methods resulted in an overall satisfying and similar parameter recovery—even in the presence of moderate model misfit. Our findings suggest that in case of model misfit, the share of anchor items should exceed 20% as is currently proposed in the literature. Future studies should further investigate the effects of anchor item composition regarding unbalanced model misfit.

Highlights

Investigating differences between groups that were administered non-identical test forms in an item response theory (IRT) framework requires aligning two test forms onto a common scale, which is known as linking (Kolen and Brennan, 2014)
We systematically investigate the linking of Rasch-scaled data based on limited item pools and small to medium sample sizes
As these findings were questioning the applicability of concurrent calibration in settings based on small absolute numbers of anchor items, it was excluded from further analyses

Summary

Introduction

Investigating differences between groups that were administered non-identical test forms in an item response theory (IRT) framework requires aligning two (or more) test forms onto a common scale, which is known as linking (Kolen and Brennan, 2014). Linking is a common procedure in the context of large-scale assessments (LSA) in educational measurement such as the IRT Linking in Limited Contexts. Programme of International Student Assessment (PISA) or the American National Assessment of Educational Progress (NAEP), which are characterized by large item pools and sample sizes. In contexts which are characterized by a limited pool of items and small to medium sample sizes (as often is the case in studies with restricted economical resources or longitudinal designs) the sparser Rasch (1960) model is a suitable alternative The linking of Rasch-scaled data in this specific context was rarely researched

Methods

Results

Discussion

Conclusion