Abstract

Abstract In educational measurement, various methods have been proposed to infer student proficiency from the ratings of multiple items (e.g., essays) by multiple raters. However, suitable models quickly become numerically demanding or even unfeasible as separate latent variables are needed to account for local dependencies between the ratings of the same response. Therefore, in the present paper we derive a flexible approach based on Thurstone’s law of categorical judgment. The advantage of this approach is that it can be fit using weighted least squares estimation which is computationally less demanding as compared to most of the previous approaches in the case of an increasing number of latent variables. In addition, the new approach can be applied using existing latent variable modeling software. We illustrate the model on a real dataset from the Trends in International Mathematics and Science Study (TIMMSS) comprising ratings of 10 items by 4 raters for 150 subjects. In addition, we compare the new model to existing models including the facet model, the hierarchical rater model, and the hierarchical rater latent class model.

Highlights

  • In the field of educational measurement inferences about students’ latent proficiencies underlying educational tests are commonly based on item response theory (IRT) modeling tools

  • While the hierarchical rater model and the generalized rater model rely on Markov Chain Monte Carlo (MCMC) estimation, the hierarchical rater latent class model, the rater bundle model, and the facet model rely mainly on Marginal Maximum Likelihood (MML)

  • The estimation time of the hierarchical rater model depends on the number of iterations, but we think it does illustrate the difference between both estimation approaches

Read more

Summary

Introduction

In the field of educational measurement inferences about students’ latent proficiencies underlying educational tests are commonly based on item response theory (IRT) modeling tools. The facet model extends the standard IRT modeling approach by adding a fixed effect for the ‘rater severity’ to account for differences in rater characteristics. As will be discussed below, effort has been devoted to develop suitable models that take both the common rater and the common item effects into account. Valuable, these models quickly become numerically demanding if the number of items and number of subjects increases. In this paper, we propose a new approach that takes the different dependencies in the data into account in a similar way as the existing models, but which can be estimated in a numerically less demanding way.

Current Models
Estimation Challenges
The Hierarchical Rater Model
The Hierarchical Rater Thresholds Model
Level 1
Application
Models
Estimation
Estimation time
Modeling results
Configuration of the ‘average laptop’ is
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call