Practical Consequences of Item Response Theory Model Misfit in the Context of Test Equating with Mixed-Format Test Data.

Yue Zhao,Ronald K Hambleton

doi:10.3389/fpsyg.2017.00484

Abstract

In item response theory (IRT) models, assessing model-data fit is an essential step in IRT calibration. While no general agreement has ever been reached on the best methods or approaches to use for detecting misfit, perhaps the more important comment based upon the research findings is that rarely does the research evaluate IRT misfit by focusing on the practical consequences of misfit. The study investigated the practical consequences of IRT model misfit in examining the equating performance and the classification of examinees into performance categories in a simulation study that mimics a typical large-scale statewide assessment program with mixed-format test data. The simulation study was implemented by varying three factors, including choice of IRT model, amount of growth/change of examinees’ abilities between two adjacent administration years, and choice of IRT scaling methods. Findings indicated that the extent of significant consequences of model misfit varied over the choice of model and IRT scaling methods. In comparison with mean/sigma (MS) and Stocking and Lord characteristic curve (SL) methods, separate calibration with linking and fixed common item parameter (FCIP) procedure was more sensitive to model misfit and more robust against various amounts of ability shifts between two adjacent administrations regardless of model fit. SL was generally the least sensitive to model misfit in recovering equating conversion and MS was the least robust against ability shifts in recovering the equating conversion when a substantial degree of misfit was present. The key messages from the study are that practical ways are available to study model fit, and, model fit or misfit can have consequences that should be considered when choosing an IRT model. Not only does the study address the consequences of IRT model misfit, but also it is our hope to help researchers and practitioners find practical ways to study model fit and to investigate the validity of particular IRT models for achieving a specified purpose, to assure that the successful use of the IRT models are realized, and to improve the applications of IRT models with educational and psychological test data.

Highlights

In item response theory (IRT) models, various methods and approaches have been suggested for detecting model misfit (Swaminathan et al, 2007), and these measures of model fit typically summarize the discrepancy between observed values and the values expected under an IRT model at either the item level or test level by statistical tests of significance and/or graphical displays
Does the amount of model misfit observed have a practical impact on the intended application? it is the consequences of the misfit that should be considered in deciding on the merits of a particular model for use in particular situations
When 3PL/2PL/generalized partial credit model (GPC) was used as the calibration model, negligible degree of misfit was expected

Summary

Introduction

While no general agreement has ever been reached on the best methods or approaches to use for detecting misfit, perhaps the more important comment based upon the research findings is that rarely does the research evaluate IRT misfit by focusing on the consequences of using misfitting items and item statistics and estimation errors associated with them. Despite the importance of assessing the consequences of IRT misfit to practical decision-making, this research area has not been given much attention as it deserves in the IRT fit literature (Hambleton and Han, 2005). Meijer and Tendeiro (2015) analyzed two empirical data sets and examined the effect of removing misfitting items and misfitting item score patterns on the rank order of test takers according to their proficiency level score, and found that the impact of removing misfitting items and item score patterns varied depending on the IRT model applied. The above studies offer good examples of the assessment of the consequences of IRT misfit through examining the agreement of the decisions made based on including/excluding misfitting items (item misfit) or including/excluding misfitting person (person misfit)

Objectives

Methods

Results

Conclusion