- New
- Research Article
- 10.1177/01466216261440511
- Mar 30, 2026
- Applied psychological measurement
- Jonas Moss
When only summary statistics from published studies are available, the Hunter-Schmidt interval is the standard tool for inference on Spearman's disattenuated correlation, but it treats reliability estimates as known constants and ignores their sampling variability. We derive a simple delta method variance that accounts for the uncertainty of all estimates while requiring nothing beyond the summaries already at hand. Under bivariate normality of scores and coefficient alpha from a normal parallel model, the corrected interval is asymptotically valid. In simulations it achieves coverage near nominal, while Hunter-Schmidt can undercover substantially when reliability is imprecisely estimated.
- Research Article
- 10.1177/01466216261425242
- Feb 13, 2026
- Applied psychological measurement
- Cody Ding
Survey questionnaires are essential tools in psychological and educational research, as the data they gather directly influence research conclusions and policy decisions. A major challenge in ensuring data quality is identifying aberrant response patterns that can jeopardize research outcomes, as they may introduce errors into subsequent analyses, potentially resulting in flawed theoretical conclusions and misguided practical applications. This study presents a machine learning solution that employs autoencoder neural networks to detect aberrant response patterns in survey data as a computational method. We evaluated the effectiveness of autoencoder neural networks in identifying response anomalies through both simulated and real data. The results indicate that this approach can effectively detect anomalies in responses, providing researchers with more options for their analyses and subsequent conclusions. Ultimately, this enhances the trustworthiness of findings in psychological and educational research.
- Research Article
- 10.1177/01466216261425440
- Feb 12, 2026
- Applied psychological measurement
- Kyung Yong Kim + 2 more
Item response theory (IRT) observed and true score equating are often conducted assuming that the latent variable is normally distributed. Although this might be a reasonable assumption for many educational and psychological assessments, not all variables can be approximated by a normal distribution. Under the common-item nonequivalent groups design, the current study examined the impact of latent density misspecification on IRT observed and true score equating. Specifically, equating results provided by two separate calibration estimates based on the Stocking-Lord linking method with normal and uniform weights and three concurrent calibration estimates obtained with different characterizations of the latent densities for the old and new groups were compared using both simulated and real data sets. In general, the concurrent calibration method with the latent densities for the two groups estimated using the empirical histogram method provided equating results with the least amount of error for most of the study conditions. Using normal weights with the Stocking-Lord method generally performed much better than using uniform weights; however, the overall performance of the Stocking-Lord method with normal weights was acceptable only if the latent densities for the two groups were normal distributions or close to normal distributions.
- Research Article
- 10.1177/01466216261422480
- Feb 9, 2026
- Applied psychological measurement
- Rudolf Debelak + 1 more
We present a fast, score-based test to detecting model misspecification in item response theory (IRT) models that remains valid when person parameters are treated as fixed effects, as may be used for very large data sets. The new approximation (i) eliminates the need to pre-specify ability groups or priors for person abilities, (ii) does not require explicit functional form assumptions, (iii) works with two estimators designed for very high item/person counts-constrained joint maximum likelihood (CJML) and joint maximum a posteriori (JMAP)-and (iv) requires only a single model fit, making DIF-screening faster and simpler than alternatives based on model comparisons. A spline-based residualization step further suppresses spurious Type I error when the ordering covariate is correlated with ability. Simulations with the two-parameter logistic model show nominal error rates and high power once examinees contribute around 15-20 responses; only extremely short tests (around 10 items) still pose challenges under strong impact. An application to 1,602 reading items and 57,684 students from the Mindsteps platform demonstrates scalability and practical value, flagging 13% of items for gender-related DIF and correlating highly with conventional approaches of explicitly modeling DIF. Together, these results position the proposed test as a robust, computation-light diagnostic for large-scale assessments when classical random-effects approaches are infeasible, ability group structure is unknown or complex, or the shape of DIF effects is unknown or complex.
- Research Article
- 10.1177/01466216261420758
- Feb 6, 2026
- Applied psychological measurement
- Jonas Bjermo + 2 more
Large-scale achievement tests require the existence of item banks with items for use in future tests. Before an item is included into the bank, its characteristics need to be estimated. The process of estimating the item characteristics is called item calibration. For the quality of the future achievement tests, it is important to perform this calibration well and it is desirable to estimate the item characteristics as efficiently as possible. Methods of optimal design have been developed to allocate pretest items to examinees with the most suited ability. Theoretical evidence shows advantages with using ability-dependent allocation of pretest items. However, it is not clear whether these theoretical results hold also in a real testing situation. In this paper, we investigate the performance of an optimal ability-dependent allocation in the context of the Swedish Scholastic Aptitude Test (SweSAT) and quantify the gain from using the optimal allocation. On average over all items, we see an improved precision of calibration. While this average improvement is moderate, we are able to identify for what kind of items the method works well. This enables targeting specific item types for optimal calibration. We also discuss possibilities for improvements of the method.
- Research Article
- 10.1177/01466216261415631
- Feb 3, 2026
- Applied psychological measurement
- Guangming Li
The Markov chain Monte Carlo (MCMC) method is more and more widely used to estimate variance components in generalizability theory (GT). However, as an essential part of MCMC method, uninformative priors haven't been explored and different GT researches vary in the use of uninformative priors. This study focused on effect of the different uninformative priors on estimating variance components. Based on p Ă— i Ă— r design, eight uninformative prior distributions were chosen for simulation study and empirical study, including [prior 1], [prior 2], [prior 3], [prior 4], [prior 5], , [prior 7], and [prior 8]. The three posterior point estimations (i.e., mean, median and mode) with full data and 10% missing/sparse data were as calculated as well. After conducting simulation study and empirical study, the result shows that: (1) [prior 1] performs best and more stably in posterior point estimations in most scenarios, while [prior 6] is always the worst one; (2) The differences among methods are mainly reflected in variance component and and prior 6 has obvious extreme bias values with the maximum value even reaching 281.09 and 167.59; (3) Posterior mean estimations always produce the biggest biases, but posterior median estimations are the best; (4) The differences in estimating variance components between uninformative priors become greater when the number of levels of the variance components is small; (5) The results between full data and 10% missing/sparse data are about the same. The small amount of missing/sparse data has a minimal impact on the results. The running time of eight distributions ranges from 489.78 to 692.58 seconds and does not differ from each other too much.
- Research Article
- 10.1177/01466216261420305
- Jan 28, 2026
- Applied psychological measurement
- Xiaozhu Jian + 3 more
This study presents a novel extension of the weighted score logistic model (WSLM). The WSLM is an advancement of the traditional dichotomous logistic model that incorporates an additional weighted score parameter. This model is specifically designed to analyze non-continuous category scored polytomous items in educational and psychological testing contexts. Within the WSLM framework, the mean difficulty parameter reflects the overall item difficulty, while both discrimination and mean difficulty parameters are estimated using marginal maximum likelihood estimation. A Monte Carlo simulation study was conducted to evaluate the performance of the WSLM, which demonstrated low levels of bias and root mean square error (RMSE) of item parameters, indicative of accurate parameter recovery. Under most simulation conditions, the fit statistics Q1 and Q4 for polytomous items under the WSLM remained below their respective critical chi-square values, suggesting acceptable model-data fit. These results support the applicability and robustness of the WSLM in practical assessment settings involving complex scoring schemes.
- Research Article
- 10.1177/01466216261416025
- Jan 20, 2026
- Applied psychological measurement
- Jari Metsämuuronen
Cohen's d is the most commonly used estimator to quantify the magnitude of the difference between the means of two subpopulations. When comparing multiple populations simultaneously, Cohen's f can be used for the same purpose. Using their relationship in the dichotomous setting, several general formulas for d are derived that generalize d to the polytomous setting. The traditional simplified estimator d = 2f is studied as a shortcut estimator. It is strongly recommended to use the general formulas instead of the simplified ones when assessing the magnitude of the effect size, especially when the discrepancy of the extreme proportions of cases in the subpopulations exceeds 0.40.
- Research Article
- 10.1177/01466216251415011
- Jan 7, 2026
- Applied psychological measurement
- Yale Quan + 1 more
Educational Constructs are becoming increasingly complex and are often conceptualized at both a general level and a subdomain level. It is often desirable to report scores from both levels simultaneously. However, to measure such complex constructs, a very large item bank that is hard for a student to complete in any reasonable timeframe is needed. Furthermore, most current score reporting practices either only report subdomain scores, or the general domain score is calculated post hoc. We propose that a multiple group HO-IRT model with structural missingness can be used to simultaneously report general and subdomain scores while controlling assessment length. Although the model itself is not new, we consider a novel application scenario using a NEAT design with both a representative and non-representative anchor test. While a representative anchor test is recommended in literature, it is sometimes unrealistic in practice when the multidimensional construct shifts over time. Hence, exploring the parameter recovery of multiple group HO-IRT in the presence of non-representative anchor test is especially interesting and important. We show, through Monte Carlo simulation, that the RMSE of IRT estimates retrieved under a non-representative anchor item set with a moderate correlation between the higher- and lower-order factors, is comparable to the RMSE of IRT estimates retrieved under a representative anchor item set. Missing data were addressed using a full-information maximum likelihood approach to parameter estimation.
- Research Article
- 10.1177/01466216251415189
- Jan 3, 2026
- Applied psychological measurement
- Sean Joo + 2 more
The field of psychometrics has made remarkable progress in developing item response theory (IRT) models for analyzing multidimensional forced choice (MFC) measures. This study introduces an innovative method that enhances the latent trait estimation of the Multi-Unidimensional Pairwise Preference (MUPP) model by incorporating latent regression modeling. To validate the efficacy of the new method, we conducted a comprehensive simulation study. The results of the study provide compelling evidence that the proposed latent regression MUPP (LR-MUPP) model significantly improves the accuracy of the latent trait estimation. This study opens new avenues for future research and encourages further development and refinement of MFC IRT models and their applications.