- New
- Research Article
- 10.1177/00131644261430762
- Apr 12, 2026
- Educational and psychological measurement
- Nathaniel M Voss + 3 more
Advances in large language models can provide opportunities to evaluate the characteristics of scales prior to data collection. In this study, we explore if item text can be used to predict a scale's psychometric properties. Specifically, we examine if clustering consensus (i.e., the frequency by which items are grouped with other items from the same underlying factor across multiple clustering algorithms), and a cosine similarity metric (i.e., the semantic similarity of items to other items from the same factor), can be used to predict exploratory factor analysis (EFA) factor loadings. Across six scales with varying sample sizes, number of factors/items, we found that both the cosine similarity and ensemble clustering consensus methods predicted factor loading values. While the methods share some conceptual and empirical overlap, and results vary by scale, the ensemble clustering approach explains incremental variance above and beyond cosine similarity in predicting factor loadings. Using both methods in conjunction can be a useful way to identify problematic items prior to data collection and help researchers develop more optimal scales from the onset, thereby potentially saving time, resources, and increasing the likelihood of developing sound measures.
- New
- Research Article
- 10.1177/00131644261435119
- Apr 9, 2026
- Educational and psychological measurement
- Martijn Schoenmakers + 3 more
Extreme response style (ERS), the tendency of participants to endorse the extreme categories of an item partially independent of item content, has repeatedly been found to decrease the validity of Likert-type scale results. For this reason, many IRT models have been developed that attempt to detect and correct for ERS. Despite the substantive literature on ERS and modeling of ERS, several important questions remain. To date, there is no clear estimate of how often ERS occurs in practice across a variety of scales and populations. In addition, there is little guidance on what item parameters for ERS models are commonly found in empirical data, while this information is crucial to inform future methodological studies utilizing ERS models. Finally, there is only limited information available on which ERS models tend to fit the data best. The current study sets out to address these three issues by analyzing data from the Programme for International Student Assessment using a generalized partial credit model, several multidimensional nominal response models, and several IRTree models. Results indicate an extremely high prevalence of ERS across scales, populations, and timepoints. Item parameters for future methodological studies are presented, and a general preference for IRTree models over MNRM models is found in many datasets. Implications for futures studies are discussed, and recommendations for practice are made.
- Research Article
- 10.1177/00131644261422169
- Mar 18, 2026
- Educational and psychological measurement
- Timo Seitz + 1 more
When personality assessments are employed in high-stakes contexts, there is the risk that test-takers provide overly positive descriptions of themselves. This response bias is known as faking and has often been addressed in latent variable models through an additional dimension capturing each test-taker's faking degree. Such models typically assume a homogeneous response strategy for all test-takers, with substantive traits and faking jointly influencing responses to all items. In this article, we present a latent response mixture item response theory (IRT) model of faking that accounts for changes in test-takers' response strategies over the course of the assessment. The model translates theoretical considerations about test-taker behavior into different model components for item responses and corresponding item-level response times (RT), thereby allowing to account for, identify, and investigate different faking-related response strategies on the person-by-item level. In a parameter recovery study, we found that the model parameters can be estimated well under realistic conditions. Also, we applied the model to an empirical dataset (N = 1,824) from a job application context, showcasing its utility in real high-stakes assessment data. We conclude the article by discussing the role of the model for psychological measurement as well as substantive research.
- Research Article
- 10.1177/00131644261426972
- Mar 17, 2026
- Educational and psychological measurement
- Joshua B Gilbert + 4 more
The use of process data, such as response time (RT) in psychometrics, has generally focused on the relationship between speed and accuracy. The potential relationships between RT and item discrimination remain less explored. In this study, we propose a model for simultaneously estimating the relationships between RT and item discrimination at the person, item, and person-by-item (residual) levels and illustrate our approach through an item-level meta-analysis of 40 empirical data sets comprising 1.84 million item responses. We find no evidence of average differences in item discrimination between items of different time intensity or persons of different average RT, while residual RT strongly and negatively predicts item discrimination (pooled coef. = -.27% per 1% difference in RT, SE = .04, = .17). While heterogeneity is high, we find little evidence of moderation by overall data set characteristics. Flexible generalized additive models show that the relationship between residual RT and item discrimination is generally curvilinear, with discrimination maximized just below average RT and minimized at the extremes. Our results suggest that RT data can provide insights into the measurement properties of educational and psychological assessments, but that the relationships between RT and item discrimination are highly variable.
- Research Article
- 10.1177/00131644251408818
- Mar 13, 2026
- Educational and psychological measurement
- Oskar Engels + 2 more
In longitudinal assessments, tests are frequently used to estimate trends over time. However, when item parameters lack invariance, time-point comparisons can be distorted, necessitating appropriate statistical methods to achieve accurate estimation. This study compares trend estimates using the two-parameter logistic (2PL) model under item parameter drift (IPD) across five trend-estimation approaches for two time points: First, concurrent calibration, which jointly estimates item parameters across multiple time points. Second, fixed calibration, which estimates item parameters at a single time point and fixes them at the other time point. Third, robust linking with Haberman and Haebara as linking methods with or losses. Fourth, non-invariant items are detected using likelihood-ratio tests or the root mean square deviation statistic with fixed or data-driven cutoffs, and trend estimates are then recomputed using only the detected invariant items under partial invariance. Fifth, regularized estimation under a smooth Bayesian information criterion (SBIC) is applied, shrinking small or null IPD effects toward zero while estimating all others as nonzero. Bias and relative root mean square error (RMSE) were evaluated for the mean and SD at T2. An empirical example using synthetic longitudinal reading data, applying the trend-estimation approaches, is provided. The results indicate that the regularized estimation with SBIC performed best across conditions, maintaining low bias and RMSE, followed by robust linking methods. Specifically, Haberman linking with the loss function showed superior performance under unbalanced IPD, outperforming the partial invariance approaches. Concurrent and fixed calibration showed the poorest trend recovery under unbalanced IPD conditions.
- Research Article
- 10.1177/00131644261419028
- Mar 11, 2026
- Educational and psychological measurement
- Karl Schweizer + 2 more
The capability of confirmatory factor analysis to discriminate common systematic variation of attribute, item-position, and wording effects was investigated using the congeneric and tau-equivalent models. The simulated data generated according to four approaches included gradually increased amounts of item-position or wording effect variation while the amount of attribute variation was kept constant. The congeneric model always signified good model fit independently of the type and amount of additional common systematic variation, that is, there was no discrimination. In applications of the tau-equivalent model, the increase of the item-position or wording effect variation led to the change from indicating good fit to bad model fit, that is, there was negative discrimination. In contrast, the additionally considered two-factor tau model discriminated positively. As a consequence of these results, we recommend the pre-screening of data for method effects.
- Research Article
- 10.1177/00131644261420391
- Feb 25, 2026
- Educational and psychological measurement
- Yuanyuan J Stirn + 1 more
This paper proposes an information-based analytic method for calculating the conditional standard error of measurement (CSEM) in multistage testing (MST) using maximum likelihood estimation. The accuracy of the proposed method was evaluated by comparing CSEMs computed using the analytic method with those obtained from simulation across the same four MST designs. The results show that analytic and simulation-based CSEMs converge as test length increases, indicating that the proposed method provides a reliable approximation for longer tests. However, shorter tests and more complex MST designs require additional items to achieve comparable accuracy. The study also compared the proposed method with Park et al.'s analytic approach. Practical implications of the proposed method are discussed.
- Research Article
- 10.1177/00131644261419426
- Feb 23, 2026
- Educational and psychological measurement
- Santeri Holopainen + 3 more
Response Time Threshold Methods (RTTMs) are widely used to identify rapid-guessing behavior (RG) in low-stakes assessments, yet face two key challenges: (a) inevitable misclassifications due to overlapping response time distributions of engaged and disengaged responses, and (b) lack of agreement on which method to use under varying conditions. This simulation study evaluated five RTTMs. Item responses and response times were generated from either a one-component model without RG or a two-component mixture model with RG in the population. Distribution, item, and person parameters were varied. Results showed that when the population contained RG, the mixture lognormal distribution-based method (MLN) was the most robust approach and estimated precise thresholds closest to the time points at which the misclassification rates were minimized, even when bimodality was more difficult to detect. The cumulative proportion method (CUMP) was less robust but also accurate when successful, though less precise. In addition, when the population did not include RG, CUMP was the only method to set thresholds for a notable proportion of cases. The methods were generally more conservative than liberal, though the mixture response time quantile method (MRTQ) was neither. The results are discussed in the light of prior RG research and the methods' characteristics, and future directions are suggested. Ultimately, for practical settings, we recommend a six-step process for RG identification that utilizes both a mixture modeling approach (MLN or MRTQ) and the CUMP method.
- Research Article
- 10.1177/00131644261418138
- Feb 16, 2026
- Educational and psychological measurement
- Mingfeng Xue + 3 more
Large language models (LLMs) have shown great potential in automatic scoring. However, due to model characteristics and variation in training materials and pipelines, scoring inconsistency can exist within an LLM and across LLMs when rating the same response multiple times. This study investigates the intra-LLM and inter-LLM consistency in scoring with five LLMs (i.e., Claude, DeepSeek, Gemini, GPT, and Qwen), variability under different temperatures, and their relationship with scoring accuracy. Moreover, a voting strategy that assembles information from different LLMs was proposed to address inconsistent scoring. Using constructed-response items from a science education assessment and open-source data from the Automated Student Assessment Prize (ASAP), we find that: (a) LLMs generally exhibited almost perfect intra-LLM consistency regardless of temperature; (b) inter-LLM consistency was moderate, with higher agreement observed for items that were easier to score; (c) intra-LLM consistency consistently exceeded inter-LLM consistency, supporting the expectation that within-model consistency represents an upper bound for cross-model agreement; (d) intra-LLM consistency was not associated with scoring accuracy, whereas inter-LLM consistency showed a strong positive relationship with accuracy; and (e) majority voting across LLMs improved scoring accuracy by leveraging complementary strengths of different models.
- Research Article
- 10.1177/00131644261417643
- Feb 16, 2026
- Educational and psychological measurement
- Irene Gianeselli
Inter-rater reliability is commonly assessed using chance-corrected agreement coefficients such as Cohen's κ, which summarize concordance among categorical judgments without modeling the inferential processes that generate them. As a result, κ is sensitive to prevalence imbalance, task difficulty, and heterogeneity in decision criteria and is often misinterpreted as a proxy for diagnostic accuracy or rater competence. This paper reframes inter-rater reliability within a signal detection-theoretic (SDT) framework in which categorical judgments arise from comparisons between latent continuous evidence and rater-specific decision thresholds. Within this generative model, κ can be interpreted as a bounded transformation of discrete strategic variance (i.e., the observable consequence of dispersion in latent decision criteria) rather than as a direct measure of epistemic alignment. To make this structure explicit, we introduce the Strategic Convergence Index (SCI), a normalized functional summarizing convergence in rater decision thresholds under an SDT generative process. SCI is not proposed as a standalone agreement coefficient but as a model-implied quantity whose interpretation depends on explicit assumptions about evidence distributions and decision rules. Monte Carlo simulations show that κ varies systematically with prevalence and perceptual discriminability even when decision-policy alignment is held constant, whereas SCI selectively tracks epistemic alignment and remains invariant to these factors. Supplementary model-based analyses further illustrate that SCI can be recovered as a stable system-level property even under latent-truth uncertainty, whereas individual thresholds may be weakly identified. Together, these results clarify the epistemic meaning of κ and motivate a decomposition of inter-rater reliability into outcome-level agreement and process-level alignment. By linking classical agreement statistics to an explicit generative model of judgment, the Strategic Convergence framework advances reliability assessment from description toward explanation.