Evaluation of Item Fit With Output From the EM Algorithm: RMSD Index Based on Posterior Expectations.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

In item response theory modeling, item fit analysis using posterior expectations, otherwise known as pseudocounts, has many advantages. They are readily obtained from the E-step output of the Bock-Aitkin Expectation-Maximization (EM) algorithm and continue to function as a basis of evaluating model fit, even when missing data are present. This paper aimed to improve the interpretability of the root mean squared deviation (RMSD) index based on posterior expectations. In Study 1, we assessed its performance using two approaches. First, we employed the poor person's posterior predictive model checking (PP-PPMC) to compute their significance levels. The resulting Type I error was generally controlled below the nominal level, but power noticeably declined with smaller sample sizes and shorter test lengths. Second, we used receiver operating characteristic (ROC) curve analysis (±) to empirically determine the reference values (cutoff thresholds) that achieve an optimal balance between false-positive and true-positive rates. Importantly, we identified optimal reference values for each combination of sample size and test length in the simulation conditions. The cutoff threshold approach outperformed the PP-PPMC approach with greater gains in true-positive rates than losses from the inflated false-positive rates. In Study 2, we extended the cutoff threshold approach to conditions with larger sample sizes and longer test lengths. Moreover, we evaluated the performance of the optimized cutoff thresholds under varying levels of data missingness. Finally, we employed response surface analysis (±) to develop a prediction model that generalizes the way the reference values vary with sample size and test length. Overall, this study demonstrates the application of the PP-PPMC for item fit diagnostics and implements a practical frequentist approach to empirically derive reference values. Using our prediction model, practitioners can compute the reference values of RMSD that are tailored to their dataset's sample size and test length.

Similar Papers
  • Research Article
  • Cite Count Icon 3
  • 10.21031/epod.02072
Madde Tepki Kuramına ait Parametrelerin ve Model Uyumlarının Karşılaştırılması: Bir Monte Carlo Çalışması
  • Dec 26, 2015
  • Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi
  • Hakan Koğar

The purpose of this study is to identify and compare NIRT, PIRT and MIRT across different sample sizes, test length and correlation between dimensions in a two dimensional simple structures. Data sets in various conditions have been simulated. These conditions are sample size (100, 500, 1000 and 5000), test length (5, 15 and 25) and correlation between dimensions (0.00, 0.25 and 0.50). From each experimental design, within the frame of Monte Carlo study, the findings have been obtained through 20 replications. For the item parameters and model data fit for the items, standard errors and significance values have been calculated. Having analyzed the findings of the research, with the increase of sample sizes and test length, it is also found out that the model data fit for the test has increased as well. It can be stated that tests consisting of less items fit better to MIRT models. In all simulation designs, model data fit for the items are calculated with quite low errors in NIRT. When the chi-square, infit and outfit values obtained for PIRT have been analyzed, it has been revealed that along with the increase of sample sizes and test length, all three coefficients exhibit better model fit. In NIRT, the standard errors belonging to Hi and p parameters tend to decrease with the increase of sample sizes and test length. In PIRT, a parameters tend to decrease when the sample sizes and test length increase

  • Research Article
  • Cite Count Icon 35
  • 10.12738/estp.2017.1.0270
The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory
  • Jan 1, 2017
  • Educational Sciences: Theory & Practice
  • Alper Şahin + 1 more

This study investigates the effects of sample size and test length on item-parameter estimation in test development utilizing three unidimensional dichotomous models of item response theory (IRT). For this purpose, a real language test comprised of 50 items was administered to 6,288 students. Data from this test was used to obtain data sets of three test lengths (10, 20, and 30 items) and nine different sample sizes (150, 250, 350, 500, 750, 1,000, 2,000, 3,000 and 5,000 examinees). These data sets were then used to create various research conditions in which test length, sample size, and IRT model variables were manipulated to investigate item parameter estimation accuracy under different conditions. The results suggest that rather than sample size or test length, the combination of these two variables is important and samples of 150, 250, 350, 500, and 750 examinees can be used to estimate item parameters accurately in three unidimensional dichotomous IRT models, depending on test length and model employed.

  • Research Article
  • 10.52096/usbd.9.41.10
Examining The Effect of Sample Size and Test Length on Parameter Estimates in The Multidimensional Item Response Theory Model Using Different Algorithms
  • Jul 9, 2025
  • International Journal of Social Sciences
  • Ömer Doğan

Multidimensional Item Response Theory (MIRT) offers a robust framework for modeling complex latent traits, addressing the limitations of unidimensional models in educational and psychological assessments. This study aims to examine the estimation performance of three widely used algorithms—Expectation-Maximization (EM), Stochastic EM (SEM), and Metropolis-Hastings Robbins-Monro (MH-RM)—under varying test lengths (10, 20, and 40 items) and sample sizes (N = 1000 and 5000). Employing a Monte Carlo simulation design, datasets were generated in R to reflect multidimensional structures. Item and ability parameters were estimated, and estimation accuracy was evaluated using Root Mean Square Error (RMSE) and bias statistics. Findings indicate that under a sample size of 1000, MH-RM performs most effectively for discrimination parameters (a) in longer tests, while SEM is preferable for shorter tests. EM yielded the least biased estimates for difficulty (d) and guessing (c) parameters. At N = 5000, both EM and MH-RM demonstrated lower RMSE values, while SEM showed higher error rates. Overall, increasing sample size and test length led to improved estimation accuracy across all methods. Results corroborate previous findings and highlight the superior performance of MH-RM in high-dimensional estimation, particularly in large-scale testing scenarios. These outcomes provide practical guidance for researchers in selecting suitable estimation techniques under different test conditions. Keywords: Multidimensional Item Response Theory, Expectation Maximization, Metropolis-Hastings Robbins-Monro, Stochastic Expectation Maximization.

  • Research Article
  • 10.35516/edu.v51i3.6755
The Effectiveness of Mantel Haenszel Log Odds Ratio Method in Detecting Differential Item Functioning Across Different Sample Sizes and Test Lengths Using Real Data Analysis
  • Sep 15, 2024
  • Dirasat: Educational Sciences
  • Reem Mohammad Elyan + 1 more

Objectives: This study aims to determine the effectiveness of the Mantel Haenszel Log Odds Ratio method in detecting Differential Item Functioning (DIF) across gender, while considering variations in sample size and test length. Utilizing real data, the study draws from a sample of tenth-grade students in Jordan who participated in the 2018 PISA International Mathematics Test. Methods: The study employs the experimental methodology, utilizing three levels of sample size and test length: (342, 200, and 100) and (30, 20, and 10), respectively. Nine iterations of the DDFS program were conducted to collect the results, representing nine scenarios resulting from the intersection of sample size and test length levels. Results: The study indicates that variations in sample size and test length significantly affect the Mantel-Hanzel (MH) method. Specifically, it observes an improvement in the MH method’s ability to detect DIF items with larger sample sizes, while maintaining a consistent test length. Conversely, the method’s efficacy declines with longer test lengths, despite maintaining a fixed sample size at a specific level. Conclusion: The study recommends using a large sample size and a short test length for effective detection of DIF items using the MH method.

  • Research Article
  • Cite Count Icon 2
  • 10.1177/0146621618779985
A Posterior Predictive Model Checking Method Assuming Posterior Normality for Item Response Theory.
  • Jun 29, 2018
  • Applied Psychological Measurement
  • Megan Kuhfeld

This study investigated the violation of local independence assumptions within unidimensional item response theory (IRT) models. Bayesian posterior predictive model checking (PPMC) methods are increasingly being used to investigate multidimensionality in IRT models. The current work proposes a PPMC method for evaluating local dependence in IRT models that are estimated using full-information maximum likelihood. The proposed approach, which was termed as "PPMC assuming posterior normality" (PPMC-N), provides a straightforward method to account for parameter uncertainty in model fit assessment. A simulation study demonstrated the comparability of the PPMC-N and the Bayesian PPMC approach in the detection of local dependence in dichotomous IRT models.

  • Research Article
  • 10.24256/jpmipa.v9i1.2384
Sample Size and Test Length for Item Parameter Estimate and Exam Parameter Estimate
  • Dec 26, 2021
  • Al-Khwarizmi : Jurnal Pendidikan Matematika dan Ilmu Pengetahuan Alam
  • Riswan Riswan

The Item Response Theory (IRT) model contains one or more parameters in the model. These parameters are unknown, so it is necessary to predict them. This paper aims (1) to determine the sample size (N) on the stability of the item parameter (2) to determine the length (n) test on the stability of the estimate parameter examinee (3) to determine the effect of the model on the stability of the item and the parameter to examine (4) to find out Effect of sample size and test length on item stability and examinee parameter estimates (5) Effect of sample size, test length, and model on item stability and examinee parameter estimates. This paper is a simulation study in which the latent trait (q) sample simulation is derived from a standard normal population of ~ N (0.1), with a specific Sample Size (N) and test length (n) with the 1PL, 2PL and 3PL models using Wingen. Item analysis was carried out using the classical theory test approach and modern test theory. Item Response Theory and data were analyzed through software R with the ltm package. The results showed that the larger the sample size (N), the more stable the estimated parameter. For the length test, which is the greater the test length (n), the more stable the estimated parameter (q).

  • Research Article
  • Cite Count Icon 15
  • 10.1177/014662169201600405
The Effect of Test Length and IRT Model on the Distribution and Stability of Three Appropriateness Indexes
  • Dec 1, 1992
  • Applied Psychological Measurement
  • Brian W Noonan + 2 more

The extent to which three appropriateness indexes - Z 3 , ECIZ4, and W (a variation of Wright's person-fit statistic) - are well-standardized was investigated in a monte carlo study. To assess the effects of the item response theory (IRT) model and test length on the distribution of the indexes and their cutoff values at three false positive rates, nonaberrant response patterns were generated. ECIZ4 most closely approximated a normal distribution, showing less skewness and kurtosis than Z 3 , and W. The ECIZ4 cutoff values were affected less by test length and the IRT model than were Z 3 , and W. In contrast, the distribution of W was the least stable over replications, and its cutoff values varied greatly depending on the IRT model and test length

  • Research Article
  • 10.11591/ijere.v8i3.19807
Parameter estimation bias of dichotomous logistic item response theory models using different variables
  • Sep 1, 2019
  • International Journal of Evaluation and Research in Education (IJERE)
  • Alper Köse + 1 more

The aim of this study was to examine the precision of item parameter estimation in different sample sizes and test lengths under three parameter logistic model (3PL) item response theory (IRT) model, where the trait measured by a test was not normally distributed or had a skewed distribution.In the study, number of categories (1-0), and item response model were identified as fixed conditions, and sample size, test length variables, and the ability distributions were selected as manipulated conditions. This is a simulation study. So data simulation and data analysis were done via packages in the R programming language. Results of the study showed that item parameter estimations performed under normal distribution were much stronger and bias-free compared to non-normal distribution. Moreover, the sample size had some limited positive effect on parameter estimation. However, the test length had no effect parameter estimation. As a result the importance of normality assumptions for IRT models were highlighted and findings were discussed based on relevant literature.

  • Research Article
  • Cite Count Icon 1
  • 10.1080/00273171.2020.1753497
The Hellinger Distance within Posterior Predictive Assessment for Investigating Multidimensionality in IRT Models
  • Apr 20, 2020
  • Multivariate Behavioral Research
  • Mariagiulia Matteucci + 1 more

Under the Bayesian approach, posterior predictive model checking (PPMC) has become a popular tool for fit assessment of item response theory (IRT) models. In this study, we propose the use of the Hellinger distance within PPMC to quantify the distance between the realized and the predictive distribution of the model-based covariance for item pairs. Specifically, the case of multidimensional data analyzed with a unidimensional approach is taken into account. The results of the simulation study show the effectiveness of the method in detecting model misfit and the sensitivity to the trait correlations. An application to real data on tourism perceptions shows the feasibility of the method in practice and especially the capability of detecting potential misfit attributed to specific items.

  • Research Article
  • 10.12691/education-7-11-19
Effectiveness of Mantel-Haenszel And Logistic Regression Statistics in Detecting Differential Item Functioning Under Different Conditions of Sample Size, Ability Distribution and Test Length
  • Nov 28, 2019
  • American Journal of Educational Research
  • Ferdinand Ingubu Ukanda + 3 more

Differential Item Functioning (DIF) is a statistical method that determines if test measurements distinguish abilities by comparing two sub-population outcomes on an item. The Mantel-Haenszel (MH) and Logistic Regression (LR) statistics provide effect size measures that quantify the magnitude of DIF. The purpose of the study was to investigate through simulation the effects of sample size, ability distribution and test length on the number of DIF detections using MH and LR methods. A Factorial research design was used in the study. The population of the study consisted of 2000 examinee responses. A stratified random sampling technique was used with the stratifying criteria as the reference (r) and focal (f) groups. Small sample sizes (20r/20f), (60r/60f) and a large sample size (1000r/1000f) were established. WinGen3 statistical software was used to generate dichotomous item response data. The average effect sizes were obtained for 1000 replications. The number of DIF items were used to draw statistical graphs. The findings of the study showed that MH statistic detected more type A and B DIF items than LR regardless of the nature of Ability Distribution, Sample size and Test length. However MH statistic detected more type C DIF items than LR regardless of Ability Distribution, Sample size and Test length. The number of type C DIF items detected depended on the sample size, test length and ability distribution. Selective use of LR was therefore necessary for detecting type A and B DIF items while MH for detecting Type C DIF items. The findings of the study are of great significance to teachers, educational policy makers, test developers and test users.

  • Research Article
  • Cite Count Icon 36
  • 10.1177/0146621610390674
Fitting IRT Models to Dichotomous and Polytomous Data: Assessing the Relative Model–Data Fit of Ideal Point and Dominance Models
  • Feb 2, 2011
  • Applied Psychological Measurement
  • Louis Tay + 3 more

This study investigated the relative model—data fit of an ideal point item response theory (IRT) model (the generalized graded unfolding model [GGUM]) and dominance IRT models (e.g., the two-parameter logistic model [2PLM] and Samejima’s graded response model [GRM]) to simulated dichotomous and polytomous data generated from each of these models. The relative magnitudes of the adjusted χ2/ df ratios for item pairs and item triples at the test level were used to evaluate fit. Two simulation studies were conducted, one for dichotomous data and the other for polytomous data. Relative fit of the ideal point and dominance models were compared with respect to different conditions: test length, sample size, and sample type. In many simulated conditions, it was found that comparing relative fits (using test-level doubles and triples adjusted χ2/ df ratios) almost always consistently pointed to the correct IRT model. However, GGUM could fit dichotomous two-parameter logistic (2PL) data well when the scale length was short (i.e., 15 items); nevertheless, an examination of estimated GGUM item parameters clearly shows dominance item characteristics. Results of the simulation studies and implications are discussed.

  • Research Article
  • Cite Count Icon 38
  • 10.1177/0146621617707510
Inferential Item-Fit Evaluation in Cognitive Diagnosis Modeling.
  • May 19, 2017
  • Applied Psychological Measurement
  • Miguel A Sorrel + 4 more

Research related to the fit evaluation at the item level involving cognitive diagnosis models (CDMs) has been scarce. According to the parsimony principle, balancing goodness of fit against model complexity is necessary. General CDMs require a larger sample size to be estimated reliably, and can lead to worse attribute classification accuracy than the appropriate reduced models when the sample size is small and the item quality is poor, which is typically the case in many empirical applications. The main purpose of this study was to systematically examine the statistical properties of four inferential item-fit statistics: , the likelihood ratio (LR) test, the Wald (W) test, and the Lagrange multiplier (LM) test. To evaluate the performance of the statistics, a comprehensive set of factors, namely, sample size, correlational structure, test length, item quality, and generating model, is systematically manipulated using Monte Carlo methods. Results show that the statistic has unacceptable power. Type I error and power comparisons favor LR and W tests over the LM test. However, all the statistics are highly affected by the item quality. With a few exceptions, their performance is only acceptable when the item quality is high. In some cases, this effect can be ameliorated by an increase in sample size and test length. This implies that using the above statistics to assess item fit in practical settings when the item quality is low remains a challenge.

  • Research Article
  • Cite Count Icon 5
  • 10.1080/08957347.2013.739419
An Empirical Investigation of Methods for Assessing Item Fit for Mixed Format Tests
  • Jan 1, 2013
  • Applied Measurement in Education
  • Kyong Hee Chon + 2 more

Empirical information regarding performance of model-fit procedures has been a persistent need in measurement practice. Statistical procedures for evaluating item fit were applied to real test examples that consist of both dichotomously and polytomously scored items. The item fit statistics used in this study included the PARSCALE's , Orlando and Thissen's (2000) and , and Stone's (2000) and . The results of this study indicated that the fit of an individual item was affected by the choice of model-fit analyses. The performance of fit indices appeared to vary depending on item response theory (IRT) model mixtures used for calibration, sample size, and test length. In terms of consistency among the fit indices, the statistics based on the same approach (e.g., and ) showed considerably higher agreement in detecting misfitting items than the statistics based on different approaches (e.g., and ). Consistent and inconsistent findings compared to previous research are discussed along with practical implications.

  • Research Article
  • Cite Count Icon 30
  • 10.3102/1076998619890566
A Bias-Corrected RMSD Item Fit Statistic: An Evaluation and Comparison to Alternatives
  • Dec 19, 2019
  • Journal of Educational and Behavioral Statistics
  • Carmen Köhler + 2 more

Testing whether items fit the assumptions of an item response theory model is an important step in evaluating a test. In the literature, numerous item fit statistics exist, many of which show severe limitations. The current study investigates the root mean squared deviation (RMSD) item fit statistic, which is used for evaluating item fit in various large-scale assessment studies. The three research questions of this study are (1) whether the empirical RMSD is an unbiased estimator of the population RMSD; (2) if this is not the case, whether this bias can be corrected; and (3) whether the test statistic provides an adequate significance test to detect misfitting items. Using simulation studies, it was found that the empirical RMSD is not an unbiased estimator of the population RMSD, and nonparametric bootstrapping falls short of entirely eliminating this bias. Using parametric bootstrapping, however, the RMSD can be used as a test statistic that outperforms the other approaches—infit and outfit, S − X 2—with respect to both Type I error rate and power. The empirical application showed that parametric bootstrapping of the RMSD results in rather conservative item fit decisions, which suggests more lenient cut-off criteria.

  • Research Article
  • Cite Count Icon 118
  • 10.1111/j.1745-3984.2003.tb01150.x
Assessing Goodness of Fit of Item Response Theory Models: A Comparison of Traditional and Alternative Procedures
  • Dec 1, 2003
  • Journal of Educational Measurement
  • Clement A Stone + 1 more

Testing the goodness of fit of item response theory (IRT) models is relevant to validating IRT models, and new procedures have been proposed. These alternatives compare observed and expected response frequencies conditional on observed total scores, and use posterior probabilities for responses across θ levels rather than cross‐classifying examinees using point estimates of θ and score responses. This research compared these alternatives with regard to their methods, properties (Type 1 error rates and empirical power), available research, and practical issues (computational demands, treatment of missing data, effects of sample size and sparse data, and available computer programs). Different advantages and disadvantages related to these characteristics are discussed. A simulation study provided additional information about empirical power and Type 1 error rates.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon