- Research Article
- 10.1177/00131644251395323
- Dec 19, 2025
- Educational and psychological measurement
- Michalis P Michaelides + 1 more
Respondent behavior in questionnaires may vary in terms of attention, effort, and consistency depending on the survey administration context and motivational conditions. This pre-registered experimental study examined whether motivational context influences response inconsistency, response times, and the role of conscientiousness in survey responding. A sample of 66 university students in Cyprus completed five psychological scales under both low-stakes and high-stakes instructions in a counterbalanced within-subjects design. To identify inconsistent respondents, two index-based methods were used: the mean absolute difference (MAD) index and Mahalanobis distance. Results showed that inconsistent responding was somewhat more frequent under low-stakes conditions, although differences were generally small and significant only for selected scales when using a lenient MAD threshold. By contrast, internal consistency reliability was slightly higher, and response times were significantly longer under high-stakes instructions, indicating greater deliberation. Conscientiousness predicted lower inconsistency only in the low-stakes condition. Overall, high-stakes instructions did not substantially reduce inconsistent responding but fostered longer response times and modest gains in reliability, suggesting enhanced behavioral engagement. Implications for survey design and data quality in psychological and educational research are discussed.
- Research Article
1
- 10.1177/00131644251393483
- Dec 12, 2025
- Educational and psychological measurement
- Bruno D Zumbo
This article develops a unified geometric framework linking expectation, regression, test theory, reliability, and item response theory through the concept of Bregman projection. Building on operator-theoretic and convex-analytic foundations, the framework extends the linear geometry of classical test theory (CTT) into nonlinear and information-geometric settings. Reliability and regression emerge as measures of projection efficiency-linear in Hilbert space and nonlinear under convex potentials. The exposition demonstrates that classical conditional expectation, least-squares regression, and information projections in exponential-family models share a common mathematical structure defined by Bregman divergence. By situating CTT within this broader geometric context, the article clarifies relationships between measurement, expectation, and statistical inference, providing a coherent foundation for nonlinear measurement and estimation in psychometrics.
- Research Article
2
- 10.1177/00131644251389891
- Nov 12, 2025
- Educational and psychological measurement
- Bruno D Zumbo
This article reconceptualizes reliability as a theorem derived from the projection geometry of Hilbert space rather than an assumption of classical test theory. Within this framework, the true score is defined as the conditional expectation , representing the orthogonal projection of the observed score onto the σ-algebra of the latent variable. Reliability, expressed as , quantifies the efficiency of this projection-the squared cosine between and its true-score projection. This formulation unifies reliability with regression , factor-analytic communality, and predictive accuracy in stochastic models. The operator-theoretic perspective clarifies that measurement error corresponds to the orthogonal complement of the projection, and reliability reflects the alignment between observed and latent scores. Numerical examples and measure-theoretic proofs illustrate the framework's generality. The approach provides a rigorous mathematical foundation for reliability, connecting psychometric theory with modern statistical and geometric analysis.
- Research Article
1
- 10.1177/00131644251376553
- Nov 8, 2025
- Educational and psychological measurement
- Rashid Saif Almehrizi
Weighted inter-rater agreement allows for differentiation between levels of disagreement among rating categories and is especially useful when there is an ordinal relationship between categories. Many existing weighted inter-rater agreement coefficients are either extensions of weighted Kappa or are formulated as Cohen's Kappa-like coefficients. These measures suffer from the same issues as Cohen's Kappa, including sensitivity to the marginal distributions of raters and the effects of category prevalence. They primarily account for the possibility of chance agreement or disagreement. This article introduces a new coefficient, weighted Lambda, which allows for the inclusion of varying weights assigned to disagreements. Unlike traditional methods, this coefficient does not assume random assignment and does not adjust for chance agreement or disagreement. Instead, it modifies the observed percentage of agreement while taking into account the anticipated impact of prevalence-agreement effects. The study also outlines techniques for estimating sampling standard errors, conducting hypothesis tests, and constructing confidence intervals for weighted Lambda. Illustrative numerical examples and Monte Carlo simulations are presented to investigate and compare the performance of the new weighted Lambda with commonly used weighted inter-rater agreement coefficients across various true agreement levels and agreement matrices. Results demonstrate several advantages of the new coefficient in measuring weighted inter-rater agreement.
- Research Article
- 10.1177/00131644251379802
- Nov 7, 2025
- Educational and psychological measurement
- Haeju Lee + 3 more
Differential item functioning (DIF) has been a long-standing problem in educational and psychological measurement. In practice, the source from which DIF originates can be complex in the sense that an item can show DIF on multiple background variables of different types simultaneously. Although a variety of non-item response theory-(IRT)-based and IRT-based DIF detection methods have been introduced, they do not sufficiently address the issue of DIF evaluation when its source is complex. The recently proposed least absolute shrinkage and selection operator (LASSO) regularization method has shown promising results of detecting DIF on multiple background variables. To provide more insight, in this study, we compared three DIF detection methods, including the non-IRT-based logistic regression (LR), the IRT-based likelihood ratio test (LRT), and LASSO regularization, through a comprehensive simulation and an empirical data analysis. We found that when multiple background variables were considered, the Type I error and Power rates of the three methods for identifying DIF items on one of the variables depended on not only the sample size and its DIF magnitude but also on the DIF magnitude of the other background variable and the correlation between them. We presented other findings and discussed the limitations and future research directions in this paper.
- Research Article
- 10.1177/00131644251380540
- Nov 3, 2025
- Educational and psychological measurement
- Rashid Saif Almehrizi
Fleiss's Kappa is an extension of Cohen's Kappa, developed to assess the degree of interrater agreement among multiple raters or methods classifying subjects using categorical scales. Like Cohen's Kappa, it adjusts the observed proportion of agreement to account for agreement expected by chance. However, over time, several paradoxes and interpretative challenges have been identified, largely stemming from the assumption of random chance agreement and the sensitivity of the coefficient to the number of raters. Interpreting Fleiss's Kappa can be particularly difficult due to its dependence on the distribution of categories and prevalence patterns. This paper argues that a portion of the observed agreement may be better explained by the interaction between category prevalence and inherent category characteristics, such as ambiguity, appeal, or social desirability, rather than by chance alone. By shifting away from the assumption of random rater assignment, the paper introduces a novel agreement coefficient that adjusts for the expected agreement by accounting for category prevalence, providing a more accurate measure of interrater reliability in the presence of imbalanced category distributions. It also examines the theoretical justification for this new measure, its interpretability, its standard error, and the robustness of its estimates in simulation and practical applications.
- Research Article
- 10.1177/00131644251377381
- Nov 3, 2025
- Educational and psychological measurement
- Daniel A Sass + 1 more
Observed variable and factor selection are critical components of factor analysis, particularly when the optimal subset of observed variables and the number of factors are unknown and results cannot be replicated across studies. The Replicable Factor Analytic Solutions (RFAS) algorithm was developed to assess the replicability of factor structures-both in terms of the number of factors and the variables retained-while identifying the "best" or most replicable solutions according to predefined criteria. This study evaluated RFAS performance across 54 experimental conditions that varied in model complexity (six-factor models), interfactor correlations (ρ = 0, .30, and .60), and sample sizes (n = 300, 500, and 1000). Under default settings, RFAS generally performed well and demonstrated its utility in producing replicable factor structures. However, performance declined with highly correlated factors, smaller sample sizes, and more complex models. RFAS was also compared to four alternative variable selection methods: Ant Colony Optimization (ACO), Weighted Group Least Absolute Shrinkage and Selection Operator (LASSO), and stepwise procedures based on target Tucker-Lewis Index (TLI) and ΔTLI criteria. Stepwise and LASSO methods were largely ineffective at eliminating problematic variables under the studied conditions. In contrast, both RFAS and ACO successfully removed variables as intended, although the resulting factor structures often differed substantially between the two approaches. As with other variable selection methods, refining algorithmic criteria may be necessary to further enhance model performance.
- Research Article
1
- 10.1177/00131644251380585
- Oct 29, 2025
- Educational and psychological measurement
- Jiayi Liu + 4 more
The Common Persons (CP) equating design offers critical advantages for high-security testing contexts-eliminating anchor item exposure risks while accommodating non-equivalent groups-yet few studies have systematically examined how CP characteristics influence equating accuracy, and the field still lacks clear implementation guidelines. Addressing this gap, this comprehensive Monte Carlo simulation (N = 5,000 examinees per form; 500 replications) evaluates CP equating by manipulating 8 factors: test length, difficulty shift, ability dispersion, correlation between test forms and CP characteristics. Four equating methods (identity, IRT true-score, linear, equipercentile) were compared using normalized RMSE and %Bias. Key findings reveal: (a) when the CP sample size reaches at least 30, CP sample properties exert negligible influence on accuracy, challenging assumptions about distributional representativeness; (b) Test factors dominate outcomes-difficulty shifts ( = 1) degrade IRT precision severely (|%Bias| >22% vs. linear/equipercentile's |%Bias| <1.5%), while longer tests reduce NRMSE and wider ability dispersion ( = 1) enhances precision through improved person-item targeting; (c) Equipercentile and linear methods demonstrate superior robustness under form differences. We establish minimum operational thresholds: ≥30 CPs covering the score range suffice for precise equating. These results provide an evidence-based framework for CP implementation by systematically examining multiple manipulated factors, resolving security-vs-accuracy tradeoffs in high-stakes equating (e.g., credentialing exams) and enabling novel solutions like synthetic respondents.
- Research Article
- 10.1177/00131644251379773
- Oct 27, 2025
- Educational and psychological measurement
- Xinya Liang + 3 more
In applied research across education, the social and behavioral sciences, and medicine, path models frequently incorporate both continuous and ordinal manifest variables to predict binary outcomes. This study employs Monte Carlo simulations to evaluate six estimators: robust maximum likelihood with probit and logit links (MLR-probit, MLR-logit), mean- and variance-adjusted weighted and unweighted least squares (WLSMV, ULSMV), and Bayesian methods with noninformative and weakly informative priors (Bayes-NI, Bayes-WI). Across various sample sizes, variable scales, and effect sizes, results show that WLSMV and Bayes-WI consistently achieve low bias and RMSE, particularly in small samples or when mediators have few categories. By contrast, categorical MLR approaches tended to yield unstable estimates for modest effects. These findings offer practical guidance for selecting estimators in mixed-scale path analyses and underscore their implications for robust inference.
- Research Article
- 10.1177/00131644251380777
- Oct 23, 2025
- Educational and psychological measurement
- Larry V Hedges
Researchers conducting systematic reviews and meta-analyses often encounter studies in which the research design is a well conducted cluster randomized trial, but the statistical analysis does not take clustering into account. For example, the study might assign treatments by clusters but the analysis may not take into account the clustered treatment assignment. Alternatively, the analysis of the primary outcome of the study might take clustering into account, but the reviewer might be interested in another outcome for which only summary data are available in a form that does not take clustering into account. This article provides expressions for the approximate variance of risk differences, log risk ratios, and log odds ratios computed from clustered binary data, using the intraclass correlations. An example illustrates the calculations. References to empirical estimates of intraclass correlations are provided.