Guessing During Testing is a Person Attribute Not an Instrument Parameter
The three-parameter logistic (3PL) model in item-response theory (IRT) has long been used to account for guessing in multiple-choice assessments through a fixed item-level parameter. However, this approach treats guessing as a property of the test item rather than the individual, potentially misrepresenting the cognitive processes underlying the examinee’s behavior. This study evaluates a novel alternative, the Two-Parameter Logistic Extension (2PLE) model, which re-conceptualizes guessing as a function of a person’s ability rather than as an item-specific constant. Using Monte Carlo simulation and empirical data from the PIRLS 2021 reading comprehension assessment, we compared the 3PL and 2PLE models on the recovery of latent ability, predictive fit (Leave-One-Out Information Criterion [LOOIC]), and theoretical alignment with test-taking behavior. The simulation results demonstrated that although both models performed similarly in terms of root-mean-squared error (RMSE) for ability estimates, the 2PLE model consistently achieved superior LOOIC values across conditions, particularly with longer tests and larger sample sizes. In an empirical analysis involving the reading achievement of 131 fourth-grade students from Saudi Arabia, model comparison again favored 2PLE, with a statistically significant LOOIC difference (ΔLOOIC = 0.482, z = 2.54). Importantly, person-level guessing estimates derived from the 2PLE model were significantly associated with established person-fit statistics (C*, U3), supporting their criterion validity. These findings suggest that the 2PLE model provides a more cognitively plausible and statistically robust representation of examinee behavior by embedding an ability-dependent guessing function.
- Research Article
- 10.1080/03610918.2023.2245175
- Aug 3, 2023
- Communications in Statistics - Simulation and Computation
Depending on the developments in technology and information, paper-pencil tests leave their place for computerized adaptive tests (CATs). CAT is widely used in the field of health, mainly in psychiatry. Many item response theory models have been proposed in the literature regarding the use of response time focusing on item difficulty and personal characteristics by ignoring the multidimensional interactions, therefore these results may cause bias in estimates of individual ability levels. The present simulation study was conducted to compare the performance of CAT applications of the effort-moderated item response theory (EM-IRT) model, which is based on response time, and the three-parameter logistic (3PL) model. While simulating CAT with the EM-IRT model and the 3PL model, the hybrid method was used for ability estimation and maximum Fisher information (MFI) was used for item selection. The CAT process proceeded until the standard error of the estimation was <0.3 and <0.5, or all items in the item bank were used. The number of individuals was specified as 1000, while the number of items was changed to 50, 100, and 250. All six scenarios were repeated 1000 times. With the increase in the number of items and the decrease in the standard error as a stopping criterion, consistent results were obtained with true ability levels in both methods. The CAT with the EM-IRT model estimated true ability level slightly lower than CAT with the 3PL model. The EM-IRT model enables measuring the response time that could yield additional data to the physician about the mental and cognitive condition of the patient. The CAT method can be a promising method of telemedicine in the era of the pandemic.
- Research Article
10
- 10.1016/j.jkss.2019.04.001
- May 17, 2019
- Journal of the Korean Statistical Society
A comparison of Monte Carlo methods for computing marginal likelihoods of item response theory models
- Research Article
16
- 10.3390/e24060760
- May 27, 2022
- Entropy (Basel, Switzerland)
In educational large-scale assessment studies such as PISA, item response theory (IRT) models are used to summarize students’ performance on cognitive test items across countries. In this article, the impact of the choice of the IRT model on the distribution parameters of countries (i.e., mean, standard deviation, percentiles) is investigated. Eleven different IRT models are compared using information criteria. Moreover, model uncertainty is quantified by estimating model error, which can be compared with the sampling error associated with the sampling of students. The PISA 2009 dataset for the cognitive domains mathematics, reading, and science is used as an example of the choice of the IRT model. It turned out that the three-parameter logistic IRT model with residual heterogeneity and a three-parameter IRT model with a quadratic effect of the ability provided the best model fit. Furthermore, model uncertainty was relatively small compared to sampling error regarding country means in most cases but was substantial for country standard deviations and percentiles. Consequently, it can be argued that model error should be included in the statistical inference of educational large-scale assessment studies.
- Front Matter
27
- 10.1016/s1551-7144(09)00212-2
- Jan 1, 2010
- Contemporary Clinical Trials
Classical and modern measurement theories, patient reports, and clinical outcomes
- Dissertation
- 10.17077/etd.005181
- Apr 9, 2020
Nowadays it is not uncommon that tests, especially high-stakes assessments, are administered with time constraints. When a test is constructed to assess examinees’ abilities in academic knowledge, but the imposed time limits affect examinees’ test performance, speededness effects become a concern. Under such circumstances, inaccurate psychometric results and inferences might be drawn if unidimensional item response theory (IRT) models are applied in testing practice. Speededness detection methods were proposed to identify speeded responses/examinees. Thus, the purpose of the study was to comprehensively investigate how the performance of various detection methods combined with various calibration treatments compared in reducing speededness effects under the 2PL and 3PL IRT models with: (1) simulated test data under various speededness conditions, and (2) real test data. Both simulated and real data analyses were conducted in this study. Two simulation studies were conducted. For the first simulation study, two main factors were investigated: (1) degree of speededness (three levels: None, 10%, and 25%), and (2) IRT calibration model (two models: 2PL, and 3PL). The performance of various combinations of detection methods and calibration treatments were evaluated by assessing Pearson correlation, item parameter recovery, and model-data fit statistics. Data generated in the second simulation study were based on the estimated person and item parameter values obtained from IRT model calibration of the real data used in this study. Thus, the second simulation study served as a link between the pure simulation study and the real data study, because such a generation process enabled the simulated dataset to carry some characteristics of the real data, while true parameter values were known. The real data came from a large pool of a high-stakes standardized assessment items. In the current study, it was found that treating the identified speeded responses as “not-presented” could always lead to more accurate psychometric results compared to the other calibration treatments across various speededness levels under both the 2PL and 3PL IRT models. When the speededness level was large, “removing speeded examinees” could usually yield comparable results compared to “not-presented” treatments across different detection methods, and is a feasible and easily manipulated option in practice. In addition, it was found that detection methods using the item response time (RT) distribution as a speededness indicator (i.e., the INSPECT and VITP methods in the current study) generally showed better performance than the other detection methods in dealing with speededness effects. Moreover, in this study, it was found that the inclusion of the c-parameter could deal with rapid guessing strategy well. Thus, when the speededness level was not large, and mainly caused by rapid guessing behavior, “no treatment” under the 3PL IRT model yielded accurate psychometric results. The findings of the current study provide several feasible options for practitioners when speededness is a concern and unidimentional IRT models are used in the calibration or scoring process. It is hoped that this study will inspire researchers and practitioners to develop new detection methods, or ways of dealing with speededness effects under unidimensional IRT models.
- Research Article
- 10.3758/bf03214461
- Jul 1, 1985
- Behavior Research Methods, Instruments, &amp Computers
Bock and Aitkin (1981) provided a powerful estima tion technique for use with item response theory. Margi nal maximum likelihood estimation (MMLE) treats the examinee ability parameter as a random nuisance parameter that is eliminated from the estimation process by specifying a form for the ability distribution, and in tegration over that distribution. Thus, item-parameter es timates are obtained by maximum likelihood estimation (MLE) in the marginal distribution. In this way, problems such as inconsistent parameter estimates which arise in joint MLE are avoided. The MARGIE program is designed to perform MMLE of the item parameters of the one-parameter logistic (1PL), two-parameter logistic (2PL), and three-parameter logistic (3PL) item response theory (IRT) models. In addition, ability estimates are produced using either MLE or Bayesian expected a posteriori estimation (EAPE). Also, a chi-square test of the goodness of fit of the IRT model to the data is performed, using the procedure described by Yen (1981). The Models. The program estimates the parameters of the IPL, 2PL, and 3PL IRT models. The IPL model is given by Cj: 0 k
- Book Chapter
5
- 10.1007/978-3-319-07503-7_3
- Jan 1, 2015
Unidimensional item response theory (IRT) models assume that a single model applies to all people in the population. Mixture IRT models can be useful when subpopulations are suspected. The usual mixture IRT model is typically estimated assuming normally distributed latent ability. Research on normal finite mixture models suggests that latent classes potentially can be extracted even in the absence of population heterogeneity if the distribution of the data is nonnormal. Empirical evidence suggests, in fact, that test data may not always be normal. In this study, we examined the sensitivity of mixture IRT models to latent nonnormality. Single-class IRT data sets were generated using different ability distributions and then analyzed with mixture IRT models to determine the impact of these distributions on the extraction of latent classes. Preliminary results suggest that estimation of mixed Rasch models resulted in spurious latent class problems in the data when distributions were bimodal and uniform. Mixture 2PL and mixture 3PL IRT models were found to be more robust to nonnormal latent ability distributions. Two popular information criterion indices, Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC), were used to inform model selection. For most conditions, the performance of BIC index was better than the AIC for selection of the correct model.
- Conference Article
- 10.1063/1.4801273
- Jan 1, 2013
Several alternative dichotomous Item Response Theory (IRT) models have been introduced to account for guessing effect in multiple-choice assessment. The guessing effect in these models has been considered to be itemrelated. In the most classic case, pseudo-guessing in the three-parameter logistic IRT model is modeled to be the same for all the subjects but may vary across items. This is not realistic because subjects can guess worse or better than the pseudo-guessing. Derivation from the three-parameter logistic IRT model improves the situation by incorporating ability in guessing. However, it does not model non-monotone function. This paper proposes to study guessing from a subject-related aspect which is guessing test-taking behavior. Mixture Rasch model is employed to detect latent groups. A hybrid of mixture Rasch and 3-parameter logistic IRT model is proposed to model the behavior based guessing from the subjects' ways of responding the items. The subjects are assumed to simply choose a response at random. An information criterion is proposed to identify the behavior based guessing group. Results show that the proposed model selection criterion provides a promising method to identify the guessing group modeled by the hybrid model.
- Research Article
20
- 10.3390/jintelligence8010005
- Feb 4, 2020
- Journal of Intelligence
Raven’s Standard Progressive Matrices (SPM) test and related matrix-based tests are widely applied measures of cognitive ability. Using Bayesian Item Response Theory (IRT) models, I reanalyzed data of an SPM short form proposed by Myszkowski and Storme (2018) and, at the same time, illustrate the application of these models. Results indicate that a three-parameter logistic (3PL) model is sufficient to describe participants dichotomous responses (correct vs. incorrect) while persons’ ability parameters are quite robust across IRT models of varying complexity. These conclusions are in line with the original results of Myszkowski and Storme (2018). Using Bayesian as opposed to frequentist IRT models offered advantages in the estimation of more complex (i.e., 3–4PL) IRT models and provided more sensible and robust uncertainty estimates.
- Research Article
1
- 10.1177/01466216221108995
- Jun 17, 2022
- Applied psychological measurement
Applying item response theory (IRT) true score equating to multidimensional IRT models is not straightforward due to the one-to-many relationship between a true score and latent variables. Under the common-item nonequivalent groups design, the purpose of the current study was to introduce two IRT true score equating procedures that adopted different dimension reduction strategies for the bifactor model. The first procedure, which was referred to as the integration procedure, linked the latent variable scales for the bifactor model and integrated out the specific factors from the item response function of the bifactor model. Then, IRT true score equating was applied to the marginalized bifactor model. The second procedure, which was referred to as the PIRT-based procedure, projected the specific dimensions onto the general dimension to obtain a locally dependent unidimensional IRT (UIRT) model and linked the scales of the UIRT model, followed by the application of IRT true score equating to the locally dependent UIRT model. Equating results obtained with the two equating procedures along with those obtained with the unidimensional three-parameter logistic (3PL) model were compared using both simulated and real data. In general, the integration and PIRT-based procedures provided equating results that were not practically different. Furthermore, the equating results produced by the two bifactor-based procedures became more accurate than the results returned by the 3PL model as tests became more multidimensional.
- Research Article
7
- 10.21449/ijate.581314
- Jan 5, 2020
- International Journal of Assessment Tools in Education
Item Response Theory (IRT) models traditionally assume a normal distribution for ability. Although normality is often a reasonable assumption for ability, it is rarely met for observed scores in educational and psychological measurement. Assumptions regarding ability distribution were previously shown to have an effect on IRT parameter estimation. In this study, the normal and uniform distribution prior assumptions for ability were compared for IRT parameter estimation when the actual distribution was either normal or uniform. A simulation study that included a short test with a small sample size and a long test with a large sample size was conducted for this purpose. The results suggested using a uniform distribution prior for ability to achieve more accurate estimates of the ability parameter in the 2PL and 3PL models when the true distribution of ability is not known. For the Rasch model, an explicit pattern that could be used to obtain more accurate item parameter estimates was not found.
- Research Article
14
- 10.3102/1076998620945199
- Aug 13, 2020
- Journal of Educational and Behavioral Statistics
The estimation of high-dimensional latent regression item response theory (IRT) models is difficult because of the need to approximate integrals in the likelihood function. Proposed solutions in the literature include using stochastic approximations, adaptive quadrature, and Laplace approximations. We propose using a second-order Laplace approximation of the likelihood to estimate IRT latent regression models with categorical observed variables and fixed covariates where all parameters are estimated simultaneously. The method applies when the IRT model has a simple structure, meaning that each observed variable loads on only one latent variable. Through simulations using a latent regression model with binary and ordinal observed variables, we show that the proposed method is a substantial improvement over the first-order Laplace approximation with respect to the bias. In addition, the approach is equally or more precise to alternative methods for estimation of multidimensional IRT models when the number of items per dimension is moderately high. Simultaneously, the method is highly computationally efficient in the high-dimensional settings investigated. The results imply that estimation of simple-structure IRT models with very high dimensions is feasible in practice and that the direct estimation of high-dimensional latent regression IRT models is tractable even with large sample sizes and large numbers of items.
- Research Article
22
- 10.1177/0146621615605080
- Sep 22, 2015
- Applied Psychological Measurement
Unidimensional, item response theory (IRT) models assume a single homogeneous population. Mixture IRT (MixIRT) models can be useful when subpopulations are suspected. The usual MixIRT model is typically estimated assuming a normally distributed latent ability. Research on normal finite mixture models suggests that latent classes potentially can be extracted, even in the absence of population heterogeneity, if the distribution of the data is non-normal. In this study, the authors examined the sensitivity of MixIRT models to latent non-normality. Single-class IRT data sets were generated using different ability distributions and then analyzed with MixIRT models to determine the impact of these distributions on the extraction of latent classes. Results suggest that estimation of mixed Rasch models resulted in spurious latent class problems in the data when distributions were bimodal and uniform. Mixture two-parameter logistic (2PL) and mixture three-parameter logistic (3PL) IRT models were found to be more robust to latent non-normality.
- Research Article
62
- 10.1080/10705511.2011.581993
- Jun 30, 2011
- Structural Equation Modeling: A Multidisciplinary Journal
Linear factor analysis (FA) models can be reliably tested using test statistics based on residual covariances. We show that the same statistics can be used to reliably test the fit of item response theory (IRT) models for ordinal data (under some conditions). Hence, the fit of an FA model and of an IRT model to the same data set can now be compared. When applied to a binary data set, our experience suggests that IRT and FA models yield similar fits. However, when the data are polytomous ordinal, IRT models yield a better fit because they involve a higher number of parameters. But when fit is assessed using the root mean square error of approximation (RMSEA), similar fits are obtained again. We explain why. These test statistics have little power to distinguish between FA and IRT models; they are unable to detect that linear FA is misspecified when applied to ordinal data generated under an IRT model.
- Research Article
28
- 10.1177/00131644211045351
- Sep 13, 2021
- Educational and Psychological Measurement
Disengaged item responses pose a threat to the validity of the results provided by large-scale assessments. Several procedures for identifying disengaged responses on the basis of observed response times have been suggested, and item response theory (IRT) models for response engagement have been proposed. We outline that response time-based procedures for classifying response engagement and IRT models for response engagement are based on common ideas, and we propose the distinction between independent and dependent latent class IRT models. In all IRT models considered, response engagement is represented by an item-level latent class variable, but the models assume that response times either reflect or predict engagement. We summarize existing IRT models that belong to each group and extend them to increase their flexibility. Furthermore, we propose a flexible multilevel mixture IRT framework in which all IRT models can be estimated by means of marginal maximum likelihood. The framework is based on the widespread Mplus software, thereby making the procedure accessible to a broad audience. The procedures are illustrated on the basis of publicly available large-scale data. Our results show that the different IRT models for response engagement provided slightly different adjustments of item parameters of individuals’ proficiency estimates relative to a conventional IRT model.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.