- Research Article
- 10.1177/00131644261418138
- Feb 16, 2026
- Educational and psychological measurement
- Mingfeng Xue + 3 more
Large language models (LLMs) have shown great potential in automatic scoring. However, due to model characteristics and variation in training materials and pipelines, scoring inconsistency can exist within an LLM and across LLMs when rating the same response multiple times. This study investigates the intra-LLM and inter-LLM consistency in scoring with five LLMs (i.e., Claude, DeepSeek, Gemini, GPT, and Qwen), variability under different temperatures, and their relationship with scoring accuracy. Moreover, a voting strategy that assembles information from different LLMs was proposed to address inconsistent scoring. Using constructed-response items from a science education assessment and open-source data from the Automated Student Assessment Prize (ASAP), we find that: (a) LLMs generally exhibited almost perfect intra-LLM consistency regardless of temperature; (b) inter-LLM consistency was moderate, with higher agreement observed for items that were easier to score; (c) intra-LLM consistency consistently exceeded inter-LLM consistency, supporting the expectation that within-model consistency represents an upper bound for cross-model agreement; (d) intra-LLM consistency was not associated with scoring accuracy, whereas inter-LLM consistency showed a strong positive relationship with accuracy; and (e) majority voting across LLMs improved scoring accuracy by leveraging complementary strengths of different models.
- Research Article
- 10.1177/00131644261417643
- Feb 16, 2026
- Educational and psychological measurement
- Irene Gianeselli
Inter-rater reliability is commonly assessed using chance-corrected agreement coefficients such as Cohen's Îş, which summarize concordance among categorical judgments without modeling the inferential processes that generate them. As a result, Îş is sensitive to prevalence imbalance, task difficulty, and heterogeneity in decision criteria and is often misinterpreted as a proxy for diagnostic accuracy or rater competence. This paper reframes inter-rater reliability within a signal detection-theoretic (SDT) framework in which categorical judgments arise from comparisons between latent continuous evidence and rater-specific decision thresholds. Within this generative model, Îş can be interpreted as a bounded transformation of discrete strategic variance (i.e., the observable consequence of dispersion in latent decision criteria) rather than as a direct measure of epistemic alignment. To make this structure explicit, we introduce the Strategic Convergence Index (SCI), a normalized functional summarizing convergence in rater decision thresholds under an SDT generative process. SCI is not proposed as a standalone agreement coefficient but as a model-implied quantity whose interpretation depends on explicit assumptions about evidence distributions and decision rules. Monte Carlo simulations show that Îş varies systematically with prevalence and perceptual discriminability even when decision-policy alignment is held constant, whereas SCI selectively tracks epistemic alignment and remains invariant to these factors. Supplementary model-based analyses further illustrate that SCI can be recovered as a stable system-level property even under latent-truth uncertainty, whereas individual thresholds may be weakly identified. Together, these results clarify the epistemic meaning of Îş and motivate a decomposition of inter-rater reliability into outcome-level agreement and process-level alignment. By linking classical agreement statistics to an explicit generative model of judgment, the Strategic Convergence framework advances reliability assessment from description toward explanation.
- Research Article
- 10.1177/00131644251395590
- Jan 13, 2026
- Educational and psychological measurement
- Jana Welling + 2 more
Educational large-scale assessments provide information on ability differences between groups, informing policies and shaping educational decisions. However, some of these differences might partly reflect variations in test-taking motivation rather than in actual abilities. Existing approaches for mitigating the distorting effects of rapid guessing focus mainly on point estimates of abilities, although research questions often refer to latent variables. The present study seeks to (a) determine the bias introduced by rapid guessing in group comparisons based on plausible value estimates and (b) introduce and evaluate different approaches of handling rapid guessing in the estimation of plausible values. In a simulation study, four models were compared: (1) a baseline model did not account for rapid guessing, (2) a person-level model incorporated rapid guessing as a respondent characteristic in the background model, (3) a response-level model filtered responses with item response times lower than a predetermined threshold, and (4) a combined model merged the person- and response-level approaches. Results show that the response-level and combined model performed best while accounting for rapid guessing on the person level did not suffice. An empirical example using data from a German large-scale assessment (N = 478) demonstrates the applicability of all approaches in practice. Recommendations for future research are given to improve ability estimation.
- Research Article
- 10.1177/00131644251399588
- Jan 4, 2026
- Educational and psychological measurement
- Jasper Bogaert + 2 more
Researchers in the behavioral, educational, and social sciences often aim to analyze relationships among latent variables. Structural equation modeling (SEM) is widely regarded as the gold standard for this purpose. A straightforward alternative for estimating the structural model parameters is uncorrected factor score regression (UFSR), where factor scores are first computed and then employed in regression or path analysis. Unfortunately, the most commonly used factor scores (i.e., Regression and Bartlett factor scores) may yield biased estimates and invalid inferences when using this approach. In recent years, factor score regression (FSR) has enjoyed several methodological advancements to address this inconsistency. Despite these advancements, the use of FSR with correlation-preserving factor scores, here termed consistent factor score regression (cFSR), has received limited attention. In this paper, we revisit cFSR and compare its advantages and disadvantages relative to other recent FSR and SEM methods. We conducted an extensive simulation study comparing cFSR with other estimation approaches, assessing their performance in terms of convergence rate, bias, efficiency, and type I error rate. The findings indicate that cFSR outperforms UFSR while maintaining the conceptual simplicity of UFSR. We encourage behavioral, educational, and social science researchers to avoid UFSR and adopt cFSR as an alternative to SEM.
- Research Article
- 10.1177/00131644251405406
- Jan 2, 2026
- Educational and psychological measurement
- Tianpeng Zheng + 3 more
A critical methodological challenge in standard setting arises in small-sample, high-dimensional contexts where the number of items substantially exceeds the number of examinees. Under such conditions, conventional data-driven methods that rely on parametric models (e.g., item response theory) often become unstable or fail due to unreliable parameter estimation. This study investigates two families of data-driven methods: information-theoretic and unsupervised clustering, offering a potential solution to this challenge. Using a Monte Carlo simulation, we systematically evaluate 15 such methods to establish an evidence-based framework for practice. The simulation manipulated five factors, including sample size, the item-to-examinee ratio, mixture proportions, item quality, and ability separation. Method performance was evaluated using multiple criteria, including Relative Error, Classification Accuracy, Sensitivity, Specificity, and Youden's Index. Results indicated that no single method is universally superior; the optimal choice depends on the examinee mixture proportion. Specifically, the information-theoretic method QIR (quantile information ratio) excelled in scenarios with a dominant non-competent group, where high specificity was critical. Conversely, in highly selective contexts with balanced proficiency groups, the clustering methods CHI (Calinski-Harabasz index) and sum of squared error (SSE) demonstrated the highest classification effectiveness. Bayesian kernel density estimation (BKDE), however, consistently performed as a robust, balanced method across conditions. These findings provide practitioners with a clear decision framework for selecting a defensible, data-driven standard-setting method when traditional approaches are infeasible.
- Research Article
- 10.1177/00131644251393444
- Dec 24, 2025
- Educational and psychological measurement
- Minho Lee + 1 more
Residual-based fit statistics, which compare observed item statistics (e.g., proportions) with model-implied probabilities, are widely used to evaluate model fit, item fit, and local dependence in item response theory (IRT) models. Despite the prevalence of item non-responses in empirical studies, their impact on these statistics has not been systematically examined. Existing software (package) often applies heuristic treatments (e.g., listwise or pairwise deletion), which can distort fit statistics because missing data further inflate discrepancies between observed and expected proportions. This study evaluates the appropriateness of such treatments through extensive simulation. Results show that deletion methods degrade the accuracy of fit testing: fit indices are inflated under both null and power conditions, with the bias worsening as missingness increases. In addition, the impact of missing data exceeds that of model misspecification. Practical recommendations and alternative methods are discussed to guide applied researchers.
- Research Article
- 10.1177/00131644251396543
- Dec 20, 2025
- Educational and psychological measurement
- Dimiter M Dimitrov + 1 more
Based on previous research on conditional reliability for number-correct test scores, conditioned on levels of the logit scale in item response theory, this article deals with conditional reliability of classical-type weighted scores conditioned on latent levels of a bounded scale. This is done in the framework of the D-scoring method of measurement (D-scale, bounded from 0 to 1). Along with the conditional reliability of weighted D-scores, conditioned on latent levels of the D-scale, presented are some additional measures of precision-conditional standard error, conditional signal-to-noise ratio, and marginal reliability. A syntax code (in R) for all computations is also provided.
- Research Article
- 10.1177/00131644251397428
- Dec 19, 2025
- Educational and psychological measurement
- Ji Yoon Jung + 2 more
Conventional cross-country scoring reliability in international large-scale assessments often depends on double scoring, which typically involves relatively small samples of multilingual responses. To extend the reach of reliability estimation, this study introduces the Linguistic-integrated Reliability Audit (LiRA), a novel method that measures scoring reliability using an entire dataset in a large-scale, multilingual context. LiRA automatically generates a second score for each response by analyzing its semantic alignment within a neighborhood of similar responses, then applies a weighted majority voting to determine a consensus score. Results demonstrate that LiRA provides a more comprehensive and systematic estimation of scoring reliability at the item, country, and language levels, while preserving the fundamental concepts of traditional reliability.
- Research Article
- 10.1177/00131644251401097
- Dec 19, 2025
- Educational and psychological measurement
- Jin Liu + 3 more
Applied researchers often encounter situations where certain item response categories receive very few endorsements, resulting in sparse data. Collapsing categories may mitigate sparsity by increasing cell counts, yet the methodological consequences of this practice remain insufficiently explored. The current study examined the effects of response collapsing in Likert-type scale data through a simulation study under the confirmatory factor analysis model. Sparse response categories were collapsed to determine the impact on fit indices (i.e., chi-square, comparative fit index [CFI], Tucker-Lewis index [TLI], root mean square error of approximation [RMSEA], and standardized root mean square residual [SRMR]). Findings indicate that category collapsing has a significant impact when sparsity is severe, leading to reduced model rejections in both correctly specified and misspecified models. In addition, different fit indices exhibited varying sensitivities to data collapsing. Specifically, RMSEA was recommended for the correctly identified model, and TLI with a cut-off value of .95 was recommended for the misspecified models. The empirical analysis was aligned with the simulation results. These results provide valuable insights for researchers confronted with sparse data in applied measurement contexts.
- Research Article
- 10.1177/00131644251393203
- Dec 19, 2025
- Educational and psychological measurement
- Francis Huang
Although cluster-robust standard errors (CRSEs) are commonly used to account for violations of observations independence found in nested data, an underappreciated issue is that there are several instances when CRSEs can fail to properly maintain the nominally accepted Type I error rate. These situations (e.g., analyzing data with imbalanced cluster sizes) can readily be found in various types of education-related datasets and are important to consider when computing statistical inference tests when using cluster-level predictors. Using a Monte Carlo simulation, we investigated these conditions and tested alternative estimators and degrees of freedom (df) adjustments to assess how well they could ameliorate the issues related to the use of the traditional CRSE (CR1) estimator using both continuous and dichotomous predictors. Findings showed that the bias-reduced linearization estimator (CR2) and the jackknife estimator (CR3) together with df adjustments were generally effective at maintaining Type I error rates for most of the conditions tested. Results also indicated that the CR1 when paired with df based on the effective cluster size was also acceptable. We emphasize the importance of clearly describing the nested data structure as the characteristics of the dataset can influence Type I error rates when using CRSEs.