Common Persons Design in Score Equating: A Monte Carlo Investigation.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

The Common Persons (CP) equating design offers critical advantages for high-security testing contexts-eliminating anchor item exposure risks while accommodating non-equivalent groups-yet few studies have systematically examined how CP characteristics influence equating accuracy, and the field still lacks clear implementation guidelines. Addressing this gap, this comprehensive Monte Carlo simulation (N = 5,000 examinees per form; 500 replications) evaluates CP equating by manipulating 8 factors: test length, difficulty shift, ability dispersion, correlation between test forms and CP characteristics. Four equating methods (identity, IRT true-score, linear, equipercentile) were compared using normalized RMSE and %Bias. Key findings reveal: (a) when the CP sample size reaches at least 30, CP sample properties exert negligible influence on accuracy, challenging assumptions about distributional representativeness; (b) Test factors dominate outcomes-difficulty shifts ( = 1) degrade IRT precision severely (|%Bias| >22% vs. linear/equipercentile's |%Bias| <1.5%), while longer tests reduce NRMSE and wider ability dispersion ( = 1) enhances precision through improved person-item targeting; (c) Equipercentile and linear methods demonstrate superior robustness under form differences. We establish minimum operational thresholds: ≥30 CPs covering the score range suffice for precise equating. These results provide an evidence-based framework for CP implementation by systematically examining multiple manipulated factors, resolving security-vs-accuracy tradeoffs in high-stakes equating (e.g., credentialing exams) and enabling novel solutions like synthetic respondents.

Similar Papers
  • Research Article
  • 10.35516/edu.v51i3.6755
The Effectiveness of Mantel Haenszel Log Odds Ratio Method in Detecting Differential Item Functioning Across Different Sample Sizes and Test Lengths Using Real Data Analysis
  • Sep 15, 2024
  • Dirasat: Educational Sciences
  • Reem Mohammad Elyan + 1 more

Objectives: This study aims to determine the effectiveness of the Mantel Haenszel Log Odds Ratio method in detecting Differential Item Functioning (DIF) across gender, while considering variations in sample size and test length. Utilizing real data, the study draws from a sample of tenth-grade students in Jordan who participated in the 2018 PISA International Mathematics Test. Methods: The study employs the experimental methodology, utilizing three levels of sample size and test length: (342, 200, and 100) and (30, 20, and 10), respectively. Nine iterations of the DDFS program were conducted to collect the results, representing nine scenarios resulting from the intersection of sample size and test length levels. Results: The study indicates that variations in sample size and test length significantly affect the Mantel-Hanzel (MH) method. Specifically, it observes an improvement in the MH method’s ability to detect DIF items with larger sample sizes, while maintaining a consistent test length. Conversely, the method’s efficacy declines with longer test lengths, despite maintaining a fixed sample size at a specific level. Conclusion: The study recommends using a large sample size and a short test length for effective detection of DIF items using the MH method.

  • Research Article
  • Cite Count Icon 7
  • 10.12738/estp.2014.6.2165
Comparing Performances (Type I error and Power) of IRT Likelihood Ratio SIBTEST and Mantel-Haenszel Methods in the Determination of Differential Item Functioning
  • Jan 1, 2014
  • Educational Sciences: Theory &amp; Practice
  • Kübra Atalay Kabasakal + 2 more

This simulation study compared the performances (Type I error and power) of Mantel-Haenszel (MH), SIBTEST, and item response theory-likelihood ratio (IRT-LR) methods under certain conditions. Manipulated factors were sample size, ability differences between groups, test length, the percentage of differential item functioning (DIF), and underlying model used to generate data. Results suggest that SIBTEST had the highest Type I error in the detection of uniform DIF, but MH had the highest power under all conditions. In addition, the percentage of DIF and the underlying model appear to have influenced the Type I error rate of IRT-LR. Ability differences between groups, test length, the percentage of DIF, model, and the interactions between ability differences*percentage of DIF, ability differences*test length, test length*percentage of DIF, test length*model affected the SIBTEST methods' Type I error rate. In the MH procedure, effective factors for Type I error rate were: sample size, test length, the percentage of DIF, ability differences*percentage of DIF, ability differences*model, and ability differences*percentage of DIF*model. No factors were effective on the power of SIBTEST and MH, but the underlying model had a significant effect on the IRT-LR power rate.

  • Research Article
  • 10.12691/education-7-11-19
Effectiveness of Mantel-Haenszel And Logistic Regression Statistics in Detecting Differential Item Functioning Under Different Conditions of Sample Size, Ability Distribution and Test Length
  • Nov 28, 2019
  • American Journal of Educational Research
  • Ferdinand Ingubu Ukanda + 3 more

Differential Item Functioning (DIF) is a statistical method that determines if test measurements distinguish abilities by comparing two sub-population outcomes on an item. The Mantel-Haenszel (MH) and Logistic Regression (LR) statistics provide effect size measures that quantify the magnitude of DIF. The purpose of the study was to investigate through simulation the effects of sample size, ability distribution and test length on the number of DIF detections using MH and LR methods. A Factorial research design was used in the study. The population of the study consisted of 2000 examinee responses. A stratified random sampling technique was used with the stratifying criteria as the reference (r) and focal (f) groups. Small sample sizes (20r/20f), (60r/60f) and a large sample size (1000r/1000f) were established. WinGen3 statistical software was used to generate dichotomous item response data. The average effect sizes were obtained for 1000 replications. The number of DIF items were used to draw statistical graphs. The findings of the study showed that MH statistic detected more type A and B DIF items than LR regardless of the nature of Ability Distribution, Sample size and Test length. However MH statistic detected more type C DIF items than LR regardless of Ability Distribution, Sample size and Test length. The number of type C DIF items detected depended on the sample size, test length and ability distribution. Selective use of LR was therefore necessary for detecting type A and B DIF items while MH for detecting Type C DIF items. The findings of the study are of great significance to teachers, educational policy makers, test developers and test users.

  • Research Article
  • Cite Count Icon 35
  • 10.12738/estp.2017.1.0270
The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory
  • Jan 1, 2017
  • Educational Sciences: Theory &amp; Practice
  • Alper Şahin + 1 more

This study investigates the effects of sample size and test length on item-parameter estimation in test development utilizing three unidimensional dichotomous models of item response theory (IRT). For this purpose, a real language test comprised of 50 items was administered to 6,288 students. Data from this test was used to obtain data sets of three test lengths (10, 20, and 30 items) and nine different sample sizes (150, 250, 350, 500, 750, 1,000, 2,000, 3,000 and 5,000 examinees). These data sets were then used to create various research conditions in which test length, sample size, and IRT model variables were manipulated to investigate item parameter estimation accuracy under different conditions. The results suggest that rather than sample size or test length, the combination of these two variables is important and samples of 150, 250, 350, 500, and 750 examinees can be used to estimate item parameters accurately in three unidimensional dichotomous IRT models, depending on test length and model employed.

  • Research Article
  • Cite Count Icon 3
  • 10.21031/epod.02072
Madde Tepki Kuramına ait Parametrelerin ve Model Uyumlarının Karşılaştırılması: Bir Monte Carlo Çalışması
  • Dec 26, 2015
  • Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi
  • Hakan Koğar

The purpose of this study is to identify and compare NIRT, PIRT and MIRT across different sample sizes, test length and correlation between dimensions in a two dimensional simple structures. Data sets in various conditions have been simulated. These conditions are sample size (100, 500, 1000 and 5000), test length (5, 15 and 25) and correlation between dimensions (0.00, 0.25 and 0.50). From each experimental design, within the frame of Monte Carlo study, the findings have been obtained through 20 replications. For the item parameters and model data fit for the items, standard errors and significance values have been calculated. Having analyzed the findings of the research, with the increase of sample sizes and test length, it is also found out that the model data fit for the test has increased as well. It can be stated that tests consisting of less items fit better to MIRT models. In all simulation designs, model data fit for the items are calculated with quite low errors in NIRT. When the chi-square, infit and outfit values obtained for PIRT have been analyzed, it has been revealed that along with the increase of sample sizes and test length, all three coefficients exhibit better model fit. In NIRT, the standard errors belonging to Hi and p parameters tend to decrease with the increase of sample sizes and test length. In PIRT, a parameters tend to decrease when the sample sizes and test length increase

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 3
  • 10.15439/2020f197
Feasibility of computerized adaptive testing evaluated by Monte-Carlo and post-hoc simulations
  • Sep 26, 2020
  • Lubomír Štěpánek + 1 more

Computerized adaptive testing (CAT) is a modern alternative to classical paper and pencil testing. CAT is based on an automated selection of optimal item corresponding to current estimate of test-taker’s ability, which is in contrast to fixed predefined items assigned in linear test. Advantages of CAT include lowered test anxiety and shortened test length, increased precision of estimates of test-takers’ abilities, and lowered level of item exposure thus better security. Challenges are high technical demands on the whole test work-flow and need of large item banks. In this study, we analyze feasibility and advantages of computerized adaptive testing using a Monte-Carlo simulation and post-hoc analysis based on a real linear admission test administrated at a medical college. We compare various settings of the adaptive test in terms of precision of ability estimates and test length. We find out that with adaptive item selection, the test length can be reduced to 40 out of 100 items while keeping the precision of ability estimates within the prescribed range and obtaining ability estimates highly correlated to estimates based on complete linear test (Pearson’s $\rho$=0.96). We also demonstrate positive effect of content balancing and item exposure rate control on item composition.

  • Research Article
  • Cite Count Icon 1
  • 10.21449/ijate.1290831
Comparison of item response theory ability and item parameters according to classical and Bayesian estimation methods
  • Jun 20, 2024
  • International Journal of Assessment Tools in Education
  • Eray Selçuk + 1 more

This research aims to compare the ability and item parameter estimations of Item Response Theory according to Maximum likelihood and Bayesian approaches in different Monte Carlo simulation conditions. For this purpose, depending on the changes in the priori distribution type, sample size, test length, and logistics model, the ability and item parameters estimated according to the maximum likelihood and Bayesian method and the differences in the RMSE of these parameters were examined. The priori distribution (normal, left-skewed, right-skewed, leptokurtic, and platykurtic), test length (10, 20, 40), sample size (100, 500, 1000), logistics model (2PL, 3PL). The simulation conditions were performed with 100 replications. Mixed model ANOVA was performed to determine RMSE differentiations. The prior distribution type, test length, and estimation method in the differentiation of ability parameter and RMSE were estimated in 2PL models; the priori distribution type and test length were significant in the differences in the ability parameter and RMSE estimated in the 3PL model. While prior distribution type, sample size, and estimation method created a significant difference in the RMSE of the item discrimination parameter estimated in the 2PL model, none of the conditions created a significant difference in the RMSE of the item difficulty parameter. The priori distribution type, sample size, and estimation method in the item discrimination RMSE were estimated in the 3PL model; the a priori distribution and estimation method created significant differentiation in the RMSE of the lower asymptote parameter. However, none of the conditions significantly changed the RMSE of item difficulty parameters.

  • Research Article
  • Cite Count Icon 1
  • 10.21031/epod.385000
The Examination of Item Difficulty Distribution, Test Length and Sample Size in Different Ability Distribution
  • Sep 29, 2018
  • Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi
  • Melek Gülşah Şahi̇n + 1 more

This is a post-hoc simulation study which investigates the effect of different item difficulty distributions, sample sizes, and test lengths on measurement precision while estimating the examinee parameters in right and left-skewed distributions. First of all, the examinee parameters were obtained from 20-item real test results for the right-skewed and left-skewed sample groups of 500, 1000, 2500, 5000, and 10000. In the second phase of the study, four different tests were formed according to the b parameter values: normal, uniform, left skewed and right skewed distributions. A total of 80 conditions were formed within the scope of this research by selecting 20-item and 30-item condition as the test length variable. In determining the measurement precision, the RMSE and AAD values were calculated. The results were evaluated in terms of the item difficulty distributions, sample sizes, and test lengths. As a result, in right-skewed examinee distribution, the highest measurement precision was obtained at the normal b distribution and the lowest measurement precision was obtained at the right skewed b distribution. A higher measurement precision was obtained in the 30-item test, however, it was observed that the change in the sample size didn’t affect the measurement precision significantly in right-skewed examinee distribution. In the left skewed distribution, the highest measurement precision was obtained at the normal b distribution and the lowest measurement precision was obtained at the left-skewed b distribution. Also it was observed that the change in the sample size and test length didn’t affect the measurement precision significantly in the left-skewed distribution.

  • Research Article
  • 10.1177/00131644251369532
Evaluation of Item Fit With Output From the EM Algorithm: RMSD Index Based on Posterior Expectations.
  • Oct 4, 2025
  • Educational and psychological measurement
  • Yun-Kyung Kim + 2 more

In item response theory modeling, item fit analysis using posterior expectations, otherwise known as pseudocounts, has many advantages. They are readily obtained from the E-step output of the Bock-Aitkin Expectation-Maximization (EM) algorithm and continue to function as a basis of evaluating model fit, even when missing data are present. This paper aimed to improve the interpretability of the root mean squared deviation (RMSD) index based on posterior expectations. In Study 1, we assessed its performance using two approaches. First, we employed the poor person's posterior predictive model checking (PP-PPMC) to compute their significance levels. The resulting Type I error was generally controlled below the nominal level, but power noticeably declined with smaller sample sizes and shorter test lengths. Second, we used receiver operating characteristic (ROC) curve analysis (±) to empirically determine the reference values (cutoff thresholds) that achieve an optimal balance between false-positive and true-positive rates. Importantly, we identified optimal reference values for each combination of sample size and test length in the simulation conditions. The cutoff threshold approach outperformed the PP-PPMC approach with greater gains in true-positive rates than losses from the inflated false-positive rates. In Study 2, we extended the cutoff threshold approach to conditions with larger sample sizes and longer test lengths. Moreover, we evaluated the performance of the optimized cutoff thresholds under varying levels of data missingness. Finally, we employed response surface analysis (±) to develop a prediction model that generalizes the way the reference values vary with sample size and test length. Overall, this study demonstrates the application of the PP-PPMC for item fit diagnostics and implements a practical frequentist approach to empirically derive reference values. Using our prediction model, practitioners can compute the reference values of RMSD that are tailored to their dataset's sample size and test length.

  • PDF Download Icon
  • Research Article
  • 10.17275/per.22.108.9.5
Effect of Item Parameter Drift in Mixed Format Common Items on Test Equating
  • Sep 1, 2022
  • Participatory Educational Research
  • İbrahim Uysal + 2 more

The aim of the study was to examine the common items in the mixed format (e.g., multiple-choices and essay items) contain parameter drifts in the test equating processes performed with the common item non-equivalent groups design. In this study, which was carried out using Monte Carlo simulation with a fully crossed design, the factors of test length (30 and 50), sample size (1000 and 3000), common item ratio (30 and 40%), ratio of items with item parameter drift (IPD) in common items (20 and 30%), location of common items in tests (at the beginning, randomly distributed, and at the end) and IPD size in multiple-choice items (low [0.2] and high [1.0]) were studied. Four test forms were created, and two test forms do not contain parameter drifts. After the parameter drift was performed on the first of the other two test forms, the parameter drift was again performed on the second test form. Test equating results were compared using the root mean squared error (RMSE) value. As a result of the research, ratio of items with IPD in common items, IPD size in multiple-choice items, common item ratio, sample size and test length on equating errors were found to be significant.

  • Research Article
  • Cite Count Icon 1
  • 10.3724/sp.j.1041.2012.00400
Dynamic and Comprehensive Item Selection Strategies for Computerized Adaptive Testing Based on Graded Response Model
  • Apr 12, 2013
  • Acta Psychologica Sinica
  • Fen Luo + 2 more

Item selection strategy (ISS) is a core component in Computerized Adaptive Testing (CAT). Polytomous items can provide more information about examinee compared with dichotomous items, and adopting polytomously scored items in test is a research direction of CAT. As we know, the most widely used ISS is the maximum Fisher information (MFI) criterion, which raises concerns about cost-efficiency of the pool utilization and poses security risks for CAT programs. Chang Ying (1999) and Chang, Qian, Ying (2001) proposed two alternative item selection procedures, the a-stratified method (a-STR) and the a-stratified with b blocking method (b-STR) based on dichotomous model, with the goal to remedy the problems of item overexposure and item underexposure produced by MFI. However, the technology of a-STR and b-STR is static because the items are stratified according to the given information at the beginning of test. Based on graded response model (GRM), a technique of the reduction dimensionality of difficulty (or step) parameters was employed to construct some ISSs recently. The limitation of this dimension reduction technique is that it loses a lot of information. Thus, in order to improve MFI, two new item selection methods are proposed based on GRM: (1) modify the technique of the reduction dimensionality of difficulty (or step) parameters by integrating the interval estimation; (2) dynamic a-STR and dynamic b-STR methods are implemented in the testing process. On one hand, these new ISSs can avoid and remedy the limitations of MFI and make good use of the advantages of the Fisher information function (FIF); FIF compresses all item parameters and ability parameters, so it is a comprehensive tool for all parameters in nature.On the other hand, the new ISSs employ the property that FIF could represent the inverse of the variance of the ability estimation, let e be the square root of the reciprocal of the Fisher information, d be the absolute deviation between the estimate ability and the function of the parameters of an item, which may be chosen and could be changed during the course of CAT, the inequality of de has the form of interval estimation, and its utility could be imaged as a more flexible shadow item pool. A simulation study based on GRM was conducted. Four item pools of different structures were simulated, and 1000 examinees was generated and their abilities were randomly drawn from the standard normal distribution N (0,1). Each pool consists of 1000 polytomous items and the maximum score of each item was randomly selected from set {3, 4, 5, 6}. In this paper, we assume the prior distribution of ability is standard normal and the Bayesian expected a posteriori (EAP) is employed to estimate the ability parameter. The CAT test stopped when the accumulative information satisfies the pre-determined value M (M=16) or reaches the pre-assigned test length 30. The results of the simulation study show that the new item selection methods required shorter test lengths and lower average exposure rates than the other methods, while maintaining the accuracy of ability estimation. More specifically, the new ISSs which applied the idea of the interval estimate were better than the correspondent ISS in terms of the Chi-square value. And the same effect appeared when comparing the dynamic a-STR and dynamic b-STR ISS with MFI. Some important results are also found by comparing different structure of item pool. The accuracy of ability estimation and item exposure rate were related to the distribution of the difficult parameters b, that is, the accuracy of ability estimation obtained from the condition in which b was sampled from N (0,1) was better than that when b was sampled from uniform distribution. The conclusion of item exposure rate is on the contrary. Also, the test length was related to the distribution of the discrimination parameter a, the test length required by the condition in which a was sampled from uniform distribution was shorter than that when the logarithm of a was sampled from N (0,1). In a word, in terms of controlling and balancing the item exposure, the new ISSs may gain an advantage over the former correspondent ISS.

  • Single Report
  • Cite Count Icon 10
  • 10.21236/ada105509
Methods for Linking Item Parameters.
  • Aug 1, 1981
  • C David Vale + 4 more

: A simulation study to determine appropriate linking methods for adaptive testing items was designed. Responses of examinees of three group sizes for four test lengths were simulated. Three basic data sets were created: (a) randomly sampled data set, (b) systematically sampled data set, and (c) selected data set. Three categories of evaluative criteria were used: fidelity of parameter estimation, asymptotic ability estimates, root-mean-square error of estimates, and the correlation between true and estimated ability. Test length appeared to be relatively more important to calibration effectiveness than was sample size, efficiency analyses suggested that increases in test length were at least three to four times as effective in improving calibration efficiency as proportionate increases in calibration sample sizes. The asymptotic ability analyses suggested that the linking procedures based on Bayesian ability estimation (an equivalent-groups procedure) were somewhat more effective than the others and that the equivalent-tests method was typically no better than not linking at all. Analyses using the relative efficiency criteria suggested that the equivalent-groups procedures were superior to the equivalent-tests procedures and that those using Bayesian scoring procedures were slightly superior to the others tested. Efficiency loss due to linking error was always less than that due to item calibration error and although test length and sample size had a definite effect on calibration efficiency, no strong effects appear with respect to linking efficiency. For the systematically sampled data set, the anchor-test and anchor-group methods were considered along with the equivalence methods.

  • Research Article
  • 10.15408/jp3i.v13i1.27975
The Impact of Sample Size, Test Length, and Person-Item Targeting on the Separation Reliability in Rasch Model: A Simulation Study
  • May 30, 2024
  • JP3I (Jurnal Pengukuran Psikologi dan Pendidikan Indonesia)
  • Rahmat S Bintang + 1 more

This research is a simulation study using resampling methods to see the effect of sample size, test length, and person-item targeting on separation reliability in the Rasch Model. Simulation conditions were created with several predetermined factors, namely sample size with five conditions (200, 500, 1000, 2000, and 4000 person), test length with three conditions (20, 40, and 60 items), and person-item targeting with five conditions (-2, -1, 0, 1, and 2 logit). The total number of conditions is 75 conditions where each condition is replicated 50 times so that a total of 3.750 data are generated. The data is generated using WinGen software. The results of the separation reliability analysis were analyzed using Winsteps software. The separation reliability criteria set are for Person Separation Reliability (PSR) &gt; 0.80 and for Item Separation Reliability (ISR) &gt; 0.90. The results showed that 75 conditions (100%) resulted in ISR estimates that met the criteria (&gt; 0.90). For PSR estimation, 37 conditions (49%) resulted in PSR estimates that met the criteria (&gt; 0.80) and 38 conditions (51%) resulted in PSR estimates that did not meet the criteria (&lt; 0.80). In addition, PSR estimation is influenced by test length and person-item targeting.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 16
  • 10.1007/s10953-019-00905-y
Activity Coefficients of Concentrated Salt Solutions: A Monte Carlo Investigation
  • Aug 19, 2019
  • Journal of Solution Chemistry
  • Zareen Abbas + 1 more

Monte Carlo (MC) simulations were used to calculate single ion and mean ionic activity coefficients and water activity in concentrated electrolytes and at elevated temperatures. By using a concentration dependent dielectric constant, the applicability range of the MC method was extended to 3 mol·L−1 or beyond, depending on the salt. The calculated activity coefficients were fitted to experimental data by adjusting only one parameter, i.e., the cation radius. Fitted ionic radii obtained by such a procedure indicate the extent of cation–anion interaction in a salt solution. For example, the fitted radii of Li+ and Na+ in LiClO3 and NaClO3 indicate that Li+ is strongly hydrated and has a weak interaction with the ClO3− ion whereas Na+ forms ion pairs and loses its hydration. The single ion activity coefficients for protons and chloride ions in HCl were calculated by MC simulations and compared with experimental values obtained by ion selective electrodes. The calculated single ion activity coefficients for protons and chloride ions are much lower and higher, respectively, than the experimental values. However, the mean activity coefficients of HCl obtained by the MC simulations, ion selective electrodes and vapor pressure measurements are in good agreement. In the case of NaCl and KCl the calculated single ion activity coefficients of Na+, K+, and Cl− are much closer to the values obtained by ion selective electrodes. The results in HCl indicate that the hydrated proton is large and includes the chloride ion within the hydration shell, i.e., the apparent size of the chloride ion is negligible.

  • Research Article
  • Cite Count Icon 9
  • 10.1016/j.jmir.2011.09.002
Evaluation of the AAA Treatment Planning Algorithm for SBRT Lung Treatment: Comparison with Monte Carlo and Homogeneous Pencil Beam Dose Calculations
  • Nov 30, 2011
  • Journal of Medical Imaging and Radiation Sciences
  • Ermias Gete + 2 more

Evaluation of the AAA Treatment Planning Algorithm for SBRT Lung Treatment: Comparison with Monte Carlo and Homogeneous Pencil Beam Dose Calculations

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon