- New
- Research Article
- 10.1177/00131644251405406
- Jan 2, 2026
- Educational and psychological measurement
- Tianpeng Zheng + 3 more
A critical methodological challenge in standard setting arises in small-sample, high-dimensional contexts where the number of items substantially exceeds the number of examinees. Under such conditions, conventional data-driven methods that rely on parametric models (e.g., item response theory) often become unstable or fail due to unreliable parameter estimation. This study investigates two families of data-driven methods: information-theoretic and unsupervised clustering, offering a potential solution to this challenge. Using a Monte Carlo simulation, we systematically evaluate 15 such methods to establish an evidence-based framework for practice. The simulation manipulated five factors, including sample size, the item-to-examinee ratio, mixture proportions, item quality, and ability separation. Method performance was evaluated using multiple criteria, including Relative Error, Classification Accuracy, Sensitivity, Specificity, and Youden's Index. Results indicated that no single method is universally superior; the optimal choice depends on the examinee mixture proportion. Specifically, the information-theoretic method QIR (quantile information ratio) excelled in scenarios with a dominant non-competent group, where high specificity was critical. Conversely, in highly selective contexts with balanced proficiency groups, the clustering methods CHI (Calinski-Harabasz index) and sum of squared error (SSE) demonstrated the highest classification effectiveness. Bayesian kernel density estimation (BKDE), however, consistently performed as a robust, balanced method across conditions. These findings provide practitioners with a clear decision framework for selecting a defensible, data-driven standard-setting method when traditional approaches are infeasible.
- New
- Research Article
- 10.1177/00131644251393444
- Dec 24, 2025
- Educational and psychological measurement
- Minho Lee + 1 more
Residual-based fit statistics, which compare observed item statistics (e.g., proportions) with model-implied probabilities, are widely used to evaluate model fit, item fit, and local dependence in item response theory (IRT) models. Despite the prevalence of item non-responses in empirical studies, their impact on these statistics has not been systematically examined. Existing software (package) often applies heuristic treatments (e.g., listwise or pairwise deletion), which can distort fit statistics because missing data further inflate discrepancies between observed and expected proportions. This study evaluates the appropriateness of such treatments through extensive simulation. Results show that deletion methods degrade the accuracy of fit testing: fit indices are inflated under both null and power conditions, with the bias worsening as missingness increases. In addition, the impact of missing data exceeds that of model misspecification. Practical recommendations and alternative methods are discussed to guide applied researchers.
- New
- Research Article
- 10.1177/00131644251396543
- Dec 20, 2025
- Educational and psychological measurement
- Dimiter M Dimitrov + 1 more
Based on previous research on conditional reliability for number-correct test scores, conditioned on levels of the logit scale in item response theory, this article deals with conditional reliability of classical-type weighted scores conditioned on latent levels of a bounded scale. This is done in the framework of the D-scoring method of measurement (D-scale, bounded from 0 to 1). Along with the conditional reliability of weighted D-scores, conditioned on latent levels of the D-scale, presented are some additional measures of precision-conditional standard error, conditional signal-to-noise ratio, and marginal reliability. A syntax code (in R) for all computations is also provided.
- New
- Research Article
- 10.1177/00131644251397428
- Dec 19, 2025
- Educational and psychological measurement
- Ji Yoon Jung + 2 more
Conventional cross-country scoring reliability in international large-scale assessments often depends on double scoring, which typically involves relatively small samples of multilingual responses. To extend the reach of reliability estimation, this study introduces the Linguistic-integrated Reliability Audit (LiRA), a novel method that measures scoring reliability using an entire dataset in a large-scale, multilingual context. LiRA automatically generates a second score for each response by analyzing its semantic alignment within a neighborhood of similar responses, then applies a weighted majority voting to determine a consensus score. Results demonstrate that LiRA provides a more comprehensive and systematic estimation of scoring reliability at the item, country, and language levels, while preserving the fundamental concepts of traditional reliability.
- New
- Research Article
- 10.1177/00131644251401097
- Dec 19, 2025
- Educational and psychological measurement
- Jin Liu + 3 more
Applied researchers often encounter situations where certain item response categories receive very few endorsements, resulting in sparse data. Collapsing categories may mitigate sparsity by increasing cell counts, yet the methodological consequences of this practice remain insufficiently explored. The current study examined the effects of response collapsing in Likert-type scale data through a simulation study under the confirmatory factor analysis model. Sparse response categories were collapsed to determine the impact on fit indices (i.e., chi-square, comparative fit index [CFI], Tucker-Lewis index [TLI], root mean square error of approximation [RMSEA], and standardized root mean square residual [SRMR]). Findings indicate that category collapsing has a significant impact when sparsity is severe, leading to reduced model rejections in both correctly specified and misspecified models. In addition, different fit indices exhibited varying sensitivities to data collapsing. Specifically, RMSEA was recommended for the correctly identified model, and TLI with a cut-off value of .95 was recommended for the misspecified models. The empirical analysis was aligned with the simulation results. These results provide valuable insights for researchers confronted with sparse data in applied measurement contexts.
- New
- Research Article
- 10.1177/00131644251393203
- Dec 19, 2025
- Educational and psychological measurement
- Francis Huang
Although cluster-robust standard errors (CRSEs) are commonly used to account for violations of observations independence found in nested data, an underappreciated issue is that there are several instances when CRSEs can fail to properly maintain the nominally accepted Type I error rate. These situations (e.g., analyzing data with imbalanced cluster sizes) can readily be found in various types of education-related datasets and are important to consider when computing statistical inference tests when using cluster-level predictors. Using a Monte Carlo simulation, we investigated these conditions and tested alternative estimators and degrees of freedom (df) adjustments to assess how well they could ameliorate the issues related to the use of the traditional CRSE (CR1) estimator using both continuous and dichotomous predictors. Findings showed that the bias-reduced linearization estimator (CR2) and the jackknife estimator (CR3) together with df adjustments were generally effective at maintaining Type I error rates for most of the conditions tested. Results also indicated that the CR1 when paired with df based on the effective cluster size was also acceptable. We emphasize the importance of clearly describing the nested data structure as the characteristics of the dataset can influence Type I error rates when using CRSEs.
- New
- Research Article
- 10.1177/00131644251395323
- Dec 19, 2025
- Educational and psychological measurement
- Michalis P Michaelides + 1 more
Respondent behavior in questionnaires may vary in terms of attention, effort, and consistency depending on the survey administration context and motivational conditions. This pre-registered experimental study examined whether motivational context influences response inconsistency, response times, and the role of conscientiousness in survey responding. A sample of 66 university students in Cyprus completed five psychological scales under both low-stakes and high-stakes instructions in a counterbalanced within-subjects design. To identify inconsistent respondents, two index-based methods were used: the mean absolute difference (MAD) index and Mahalanobis distance. Results showed that inconsistent responding was somewhat more frequent under low-stakes conditions, although differences were generally small and significant only for selected scales when using a lenient MAD threshold. By contrast, internal consistency reliability was slightly higher, and response times were significantly longer under high-stakes instructions, indicating greater deliberation. Conscientiousness predicted lower inconsistency only in the low-stakes condition. Overall, high-stakes instructions did not substantially reduce inconsistent responding but fostered longer response times and modest gains in reliability, suggesting enhanced behavioral engagement. Implications for survey design and data quality in psychological and educational research are discussed.
- Research Article
- 10.1177/00131644251393483
- Dec 12, 2025
- Educational and psychological measurement
- Bruno D Zumbo
This article develops a unified geometric framework linking expectation, regression, test theory, reliability, and item response theory through the concept of Bregman projection. Building on operator-theoretic and convex-analytic foundations, the framework extends the linear geometry of classical test theory (CTT) into nonlinear and information-geometric settings. Reliability and regression emerge as measures of projection efficiency-linear in Hilbert space and nonlinear under convex potentials. The exposition demonstrates that classical conditional expectation, least-squares regression, and information projections in exponential-family models share a common mathematical structure defined by Bregman divergence. By situating CTT within this broader geometric context, the article clarifies relationships between measurement, expectation, and statistical inference, providing a coherent foundation for nonlinear measurement and estimation in psychometrics.
- Research Article
- 10.1177/00131644251389891
- Nov 12, 2025
- Educational and psychological measurement
- Bruno D Zumbo
This article reconceptualizes reliability as a theorem derived from the projection geometry of Hilbert space rather than an assumption of classical test theory. Within this framework, the true score is defined as the conditional expectation , representing the orthogonal projection of the observed score onto the σ-algebra of the latent variable. Reliability, expressed as , quantifies the efficiency of this projection-the squared cosine between and its true-score projection. This formulation unifies reliability with regression , factor-analytic communality, and predictive accuracy in stochastic models. The operator-theoretic perspective clarifies that measurement error corresponds to the orthogonal complement of the projection, and reliability reflects the alignment between observed and latent scores. Numerical examples and measure-theoretic proofs illustrate the framework's generality. The approach provides a rigorous mathematical foundation for reliability, connecting psychometric theory with modern statistical and geometric analysis.
- Research Article
- 10.1177/00131644251376553
- Nov 8, 2025
- Educational and psychological measurement
- Rashid Saif Almehrizi
Weighted inter-rater agreement allows for differentiation between levels of disagreement among rating categories and is especially useful when there is an ordinal relationship between categories. Many existing weighted inter-rater agreement coefficients are either extensions of weighted Kappa or are formulated as Cohen's Kappa-like coefficients. These measures suffer from the same issues as Cohen's Kappa, including sensitivity to the marginal distributions of raters and the effects of category prevalence. They primarily account for the possibility of chance agreement or disagreement. This article introduces a new coefficient, weighted Lambda, which allows for the inclusion of varying weights assigned to disagreements. Unlike traditional methods, this coefficient does not assume random assignment and does not adjust for chance agreement or disagreement. Instead, it modifies the observed percentage of agreement while taking into account the anticipated impact of prevalence-agreement effects. The study also outlines techniques for estimating sampling standard errors, conducting hypothesis tests, and constructing confidence intervals for weighted Lambda. Illustrative numerical examples and Monte Carlo simulations are presented to investigate and compare the performance of the new weighted Lambda with commonly used weighted inter-rater agreement coefficients across various true agreement levels and agreement matrices. Results demonstrate several advantages of the new coefficient in measuring weighted inter-rater agreement.