Assessment of fit of item response theory models used in large-scale educational survey assessments

Sandip Sinharay,Shelby J Haberman,Matthew S Johnson,Peter W Van Rijn

doi:10.1186/s40536-016-0025-3

Abstract

Latent regression models are used for score-reporting purposes in large-scale educational survey assessments such as the National Assessment of Educational Progress (NAEP) and Trends in International Mathematics and Science Study (TIMSS). One component of these models is based on item response theory. While there exists some research on assessment of fit of item response theory models in the context of large-scale assessments, there is a scope of further research on the topic. We suggest two types of residuals to assess the fit of item response theory models in the context of large-scale assessments. The Type I error rates and power of the residuals are computed from simulated data. The residuals are computed using data from four NAEP assessments. Misfit was found for all data sets for both types of residuals, but the practical significance of the misfit was minimal.

Highlights

IntroductionSeveral large-scale educational survey assessments (LESAs) such as the United States’ National Assessment of Educational Progress (NAEP), the International Adult Literacy Study (IALS; Kirsch 2001), the Trends in Mathematics and Science Study (TIMSS; Martin and Kelly 1996), and the Progress in International Reading Literacy Study (PIRLS; Mullis et al 2003) involve the use of item response theory (IRT) models for score-reporting purposes (e.g., Beaton 1987; Mislevy et al 1992; Von Davier and Sinharay 2014).Standard 4.10 of the Standards for Educational and Psychological Testing (American Educational Research Association 2014) recommends obtaining evidence of model fit when an IRT model is used to make inferences from a data set
There is some improvement in fit when item-specific slope parameters are used: For all four National Assessment of Educational Progress (NAEP) data sets, the biggest improvement in fit was seen between the unidimensional 1-parameter logistic (1PL)/partial credit model (PCM) and 2PL/GPCM
The focus of this paper was on the assessment of misfit of the item response theory (IRT) model used in largescale survey assessments such as NAEP using data from four NAEP assessments

Summary

Introduction

Several large-scale educational survey assessments (LESAs) such as the United States’ National Assessment of Educational Progress (NAEP), the International Adult Literacy Study (IALS; Kirsch 2001), the Trends in Mathematics and Science Study (TIMSS; Martin and Kelly 1996), and the Progress in International Reading Literacy Study (PIRLS; Mullis et al 2003) involve the use of item response theory (IRT) models for score-reporting purposes (e.g., Beaton 1987; Mislevy et al 1992; Von Davier and Sinharay 2014).Standard 4.10 of the Standards for Educational and Psychological Testing (American Educational Research Association 2014) recommends obtaining evidence of model fit when an IRT model is used to make inferences from a data set. This paper suggests two types of residuals to assess the fit of IRT models used in LESAs. One among them can be used to assess item fit and the other can be used to van Rijn et al Large-scale Assess Educ (2016) 4:10 assess other aspects of fit of these models. Yik could be the scores of student i to the algebra questions presented to her on a mathematics test and θik could represent the student’s proficiency variable for algebra. Because of the use of matrix sampling (that refers to a design in which each student is presented only a subset of all available items) in NAEP assessments, the algebra questions administered to student i are a subset Jik of the set Jk of all available algebra questions on the test.

Methods

Results

Conclusion