Articles published on Conditional Standard Errors Of Measurement
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
24 Search results
Sort by Recency
- New
- Research Article
- 10.1017/psy.2026.10122
- May 19, 2026
- Psychometrika
- L Andries Van Der Ark
User-Friendly Software and Estimated Conditional Standard Errors of Measurement. A Commentary on Pfadt et al.
- Research Article
- 10.1177/00131644261420391
- Feb 25, 2026
- Educational and psychological measurement
- Yuanyuan J Stirn + 1 more
This paper proposes an information-based analytic method for calculating the conditional standard error of measurement (CSEM) in multistage testing (MST) using maximum likelihood estimation. The accuracy of the proposed method was evaluated by comparing CSEMs computed using the analytic method with those obtained from simulation across the same four MST designs. The results show that analytic and simulation-based CSEMs converge as test length increases, indicating that the proposed method provides a reliable approximation for longer tests. However, shorter tests and more complex MST designs require additional items to achieve comparable accuracy. The study also compared the proposed method with Park et al.'s analytic approach. Practical implications of the proposed method are discussed.
- Research Article
- 10.1111/jedm.70029
- Jan 21, 2026
- Journal of Educational Measurement
- Won‐Chan Lee + 2 more
Abstract Advancements in artificial intelligence (AI) have brought significant changes to testing practices, including the emergence of randomly parallel testing (RPT), in which examinees receive different but psychometrically similar sets of items generated from templates or AI‐based systems. This paper presents a generalizability theory (GT) framework for estimating conditional standard errors of measurement (CSEMs) and related reliability indices, with a particular focus on design structures commonly encountered in RPT within domain‐referenced testing contexts. The proposed framework supports the evaluation of score precision across a variety of operational designs, including crossed, nested, and multivariate configurations. Several illustrative examples are provided to demonstrate the methodology in practical settings. The paper also addresses key psychometric and interpretive challenges associated with RPT and outlines promising directions for future research.
- Research Article
3
- 10.1177/01466216231209749
- Oct 19, 2023
- Applied psychological measurement
- Adam E Wyse
This study introduces two new statistics for measuring the score comparability of computerized adaptive tests (CATs) based on comparing conditional standard errors of measurement (CSEMs) for examinees that achieved the same scale scores. One statistic is designed to evaluate score comparability of alternate CAT forms for individual scale scores, while the other statistic is designed to evaluate the overall score comparability of alternate CAT forms. The effectiveness of the new statistics is illustrated using data from grade 3 through 8 reading and math CATs. Results suggest that both CATs demonstrated reasonably high levels of score comparability, that score comparability was less at very high or low scores where few students score, and that using random samples with fewer students per grade did not have a big impact on score comparability. Results also suggested that score comparability was sometimes higher when the bottom 20% of scorers were used to calculate overall score comparability compared to all students. Additional discussion related to applying the statistics in different contexts is provided.
- Research Article
4
- 10.1111/jedm.12140
- Jun 1, 2017
- Journal of Educational Measurement
- Tim Moses + 1 more
The focus of this article is on scale score transformations that can be used to stabilize conditional standard errors of measurement (CSEMs). Three transformations for stabilizing the estimated CSEMs are reviewed, including the traditional arcsine transformation, a recently developed general variance stabilization transformation, and a new method proposed in this article involving cubic transformations. Two examples are provided and the three scale score transformations are compared in terms of how well they stabilize CSEMs estimated from compound binomial and item response theory (IRT) models. Advantages of the cubic transformation are demonstrated with respect to CSEM stabilization and other scaling criteria (e.g., scale score distributions that are more symmetric).
- Discussion
4
- 10.1097/nnr.0000000000000078
- Mar 1, 2015
- Nursing research
- Byron Gajewski + 2 more
Sijtsma and van der Ark present a broad set of models and methods for reliability estimation, and their discussion of similarities and differences provides clear information for nurse researchers to move forward in their instrument development projects. In particular, we applaud the authors' clear exposition of the factor analytic model and its utility for providing a framework for unifying reliability and validity. However, we do not want to be constrained only to the point estimates. We also need to ascertain the uncertainty in the point estimate-usually in the form of a 95% confidence interval-or, as the Bayesians refer to, a credible interval. Another issue not discussed by Sijtsma and van der Ark is conditional standard errors of measurement along the score scale measuring latent traits or true scores. In our response, practical tools for estimating intervals and a brief discussion of conditional standard errors of measurement are presented.
- Research Article
5
- 10.1111/j.1745-3984.2012.00180.x
- Dec 1, 2012
- Journal of Educational Measurement
- Mark R Raymond + 2 more
Although a few studies report sizable score gains for examinees who repeat performance-based assessments, research has not yet addressed the reliability and validity of inferences based on ratings of repeat examinees on such tests. This study analyzed scores for 8,457 single-take examinees and 4,030 repeat examinees who completed a 6-hour clinical skills assessment required for physician licensure. Each examinee was rated in four skill domains: data gathering, communication-interpersonal skills, spoken English proficiency, and documentation proficiency. Conditional standard errors of measurement computed for single-take and multiple-take examinees indicated that ratings were of comparable precision for the two groups within each of the four skill domains; however, conditional errors were larger for low-scoring examinees regardless of retest status. In addition, on their first attempt multiple-take examinees exhibited less score consistency across the skill domains but on their second attempt their scores became more consistent. Further, the median correlation between scores on the four clinical skill domains and three external measures was .15 for multiple-take examinees on their first attempt but increased to .27 for their second attempt, a value, which was comparable to the median correlation of .26 for single-take examinees. The findings support the validity of inferences based on scores from the second attempt.
- Research Article
20
- 10.1080/15305058.2011.617476
- Jan 1, 2012
- International Journal of Testing
- Michael J Kolen + 2 more
Composite scores are often formed from test scores on educational achievement test batteries to provide a single index of achievement over two or more content areas or two or more item types on that test. Composite scores are subject to measurement error, and as with scores on individual tests, the amount of error variability typically depends on the individual's score level. A procedure is presented for estimating conditional standard errors of measurement and reliability for composite scores. Item response theory (IRT) models are used as the psychometric foundation for developing the procedure. First, a general procedure is described, followed by specific applications for estimating conditional standard errors of measurement of the ACT Assessment composite and a weighted summed score on a mathematics test. General issues in estimating conditional standard errors of measurement for composite scores are discussed.
- Research Article
4
- 10.1007/s10459-011-9309-0
- Oct 1, 2011
- Advances in Health Sciences Education
- Mark R Raymond + 2 more
Examinees who initially fail and later repeat an SP-based clinical skills exam typically exhibit large score gains on their second attempt, suggesting the possibility that examinees were not well measured on one of those attempts. This study evaluates score precision for examinees who repeated an SP-based clinical skills test administered as part of the US Medical Licensing Examination sequence. Generalizability theory was used as the basis for computing conditional standard errors of measurement (SEM) for individual examinees. Conditional SEMs were computed for approximately 60,000 single-take examinees and 5,000 repeat examinees who completed the Step 2 Clinical Skills Examination(®) between 2007 and 2009. The study focused exclusively on ratings of communication and interpersonal skills. Conditional SEMs for single-take and repeat examinees were nearly indistinguishable across most of the score scale. US graduates and IMGs were measured with equal levels of precision at all score levels, as were examinees with differing levels of skill speaking English. There was no evidence that examinees with the largest score changes were measured poorly on either their first or second attempt. The large score increases for repeat examinees on this SP-based exam probably cannot be attributed to unexpectedly large errors of measurement.
- Research Article
12
- 10.1111/j.1745-3992.2011.00201.x
- Jun 1, 2011
- Educational Measurement: Issues and Practice
- Michael J Kolen + 1 more
This paper illustrates that the psychometric properties of scores and scales that are used with mixed-format educational tests can impact the use and interpretation of the scores that are reported to examinees. Psychometric properties that include reliability and conditional standard errors of measurement are considered in this paper. The focus is on mixed-format tests in situations for which raw scores are integer-weighted sums of item scores. Four associated real-data examples include (a) effects of weights associated with each item type on reliability, (b) comparison of psychometric properties of different scale scores, (c) evaluation of the equity property of equating, and (d) comparison of the use of unidimensional and multidimensional procedures for evaluating psychometric properties. Throughout the paper, and especially in the conclusion section, the examples are related to issues associated with test interpretation and test use.
- Research Article
41
- 10.1111/j.1745-3992.2010.00179.x
- Sep 1, 2010
- Educational Measurement: Issues and Practice
- Michael J Kolen + 1 more
Psychometric properties of item response theory proficiency estimates are considered in this paper. Proficiency estimators based on summed scores and pattern scores include non‐Bayes maximum likelihood and test characteristic curve estimators and Bayesian estimators. The psychometric properties investigated include reliability, conditional standard errors of measurement, and score distributions. Four real‐data examples include (a) effects of choice of estimator on score distributions and percent proficient, (b) effects of the prior distribution on score distributions and percent proficient, (c) effects of test length on score distributions and percent proficient, and (d) effects of proficiency estimator on growth‐related statistics for a vertical scale. The examples illustrate that the choice of estimator influences score distributions and the assignment of examinee to proficiency levels. In particular, for the examples studied, the choice of Bayes versus non‐Bayes estimators had a more serious practical effect than the choice of summed versus pattern scoring.
- Research Article
3
- 10.1007/s10459-010-9221-z
- Feb 3, 2010
- Advances in Health Sciences Education
- Mark R Raymond + 2 more
The use of standardized patients to assess communication skills is now an essential part of assessing a physician's readiness for practice. To improve the reliability of communication scores, it has become increasingly common in recent years to use statistical models to adjust ratings provided by standardized patients. This study employed ordinary least squares regression to adjust ratings, and then used generalizability theory to evaluate the impact of these adjustments on score reliability and the overall standard error of measurement. In addition, conditional standard errors of measurement were computed for both observed and adjusted scores to determine whether the improvements in measurement precision were uniform across the score distribution. Results indicated that measurement was generally less precise for communication ratings toward the lower end of the score distribution; and the improvement in measurement precision afforded by statistical modeling varied slightly across the score distribution such that the most improvement occurred in the upper-middle range of the score scale. Possible reasons for these patterns in measurement precision are discussed, as are the limitations of the statistical models used for adjusting performance ratings.
- Research Article
7
- 10.1097/acm.0b013e3181b37d01
- Oct 1, 2009
- Academic Medicine
- Mark R Raymond + 3 more
Previous research has shown that ratings of English proficiency on the United States Medical Licensing Examination Clinical Skills Examination are highly reliable. However, the score distributions for native and nonnative speakers of English are sufficiently different to suggest that reliability should be investigated separately for each group. Generalizability theory was used to obtain reliability indices separately for native and nonnative speakers of English (N = 29,084). Conditional standard errors of measurement were also obtained for both groups to evaluate measurement precision for each group at specific score levels. Overall indices of reliability (phi) exceeded 0.90 for both native and nonnative speakers, and both groups were measured with nearly equal precision throughout the score distribution. However, measurement precision decreased at lower levels of proficiency for all examinees. The results of this and future studies may be helpful in understanding and minimizing sources of measurement error at particular regions of the score distribution.
- Research Article
17
- 10.1177/0146621606294206
- Jul 1, 2007
- Applied Psychological Measurement
- Won-Chan Lee
This article introduces a multinomial error model, which models an examinee's test scores obtained over repeated measurements of an assessment that consists of polytomously scored items. A compound multinomial error model is also introduced for situations in which items are stratified according to content categories and/or prespecified numbers of score points. The multinomial and compound multinomial models are implemented in this article for estimating conditional standard errors of measurement and reliability. The applicability of the multinomial and compound multinomial models is illustrated with two real data examples. The first example considers test scores obtained from polytomous items only, and the second example contains test scores from a mixture of dichotomous and polytomous items. A simulation study is conducted to examine the amount and pattern of bias in the estimated conditional standard errors of measurement. Index terms: multinomial model, compound multinomial model, standard errors of measurement, reliability, scale scores
- Research Article
7
- 10.3102/10769986031003261
- Sep 1, 2006
- Journal of Educational and Behavioral Statistics
- Won-Chan Lee + 2 more
Assuming errors of measurement are distributed binomially, this article reviews various procedures for constructing an interval for an individual’s true number-correct score; presents two general interval estimation procedures for an individual’s true scale score (i.e., normal approximation and endpoints conversion methods); compares various interval estimation procedures through a computer simulation study; and provides some practical guidelines for use of the interval estimation procedures. To examine the effects of different types of scale scores, three nonlinearly transformed scale scores are employed. The conditional confidence intervals using conditional standard errors of measurement are recommended over the traditional confidence intervals using the overall standard error of measurement. For raw scores, the score confidence intervals, in general, tend to provide actual coverage probabilities that are closest to the nominal level. Results for scale score intervals seem to favor the endpoints conversion method using the true-score conversions over the normal approximation approach.
- Research Article
5
- 10.2466/pr0.98.1.237-252
- Feb 1, 2006
- Psychological Reports
- Larry R Price + 4 more
A specific recommendation of the 1999 Standards for Educational and Psychological Testing by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education is that test publishers report estimates of the conditional standard error of measurement (SEM). Procedures for calculating the conditional (score-level) SEM based on raw scores are well documented; however, few procedures have been developed for estimating the conditional SEM of subtest or composite scale scores resulting from a nonlinear transformation. Item response theory provided the psychometric foundation to derive the conditional standard errors of measurement and confidence intervals for composite scores on the Wechsler Preschool and Primary Scale of Intelligence-Third Edition.
- Research Article
44
- 10.1111/j.1745-3984.2003.tb01097.x
- Mar 1, 2003
- Journal of Educational Measurement
- Shun‐Wen Chang + 1 more
This study compared the properties of five methods of item exposure control within the purview of estimating examinees’ abilities in a computerized adaptive testing (CAT) context. Each exposure control algorithm was incorporated into the item selection procedure and the adaptive testing progressed based on the CAT design established for this study. The merits and shortcomings of these strategies were considered under different item pool sizes and different desired maximum exposure rates and were evaluated in light of the observed maximum exposure rates, the test overlap rates, and the conditional standard errors of measurement. Each method had its advantages and disadvantages, but no one possessed all of the desired characteristics. There was a clear and logical trade‐off between item exposure control and measurement precision. The Stocking and Lewis conditional multinomial procedure and, to a slightly lesser extent, the Davey and Parshall method seemed to be the most promising considering all of the factors that this study addressed.
- Research Article
12
- 10.1207/s15324818ame1302_3
- Apr 1, 2000
- Applied Measurement in Education
- Guemin Lee
The primary purposes of this study were to investigate both the appropriateness and the implications of incorporating a testlet definition into the estimation of the conditional standard error of measurement (SEM) for tests composed of testlets. When individual items are used as the fundamental measurement unit with tests composed of testlets, the assumptions required by measurement modeling are violated, but those assumptions are satisfied when testlet scores are used as the measurement unit. Therefore, item-based estimation methods probably introduce some magnitude of bias in the estimates of conditional SEMs for tests composed of testlets. The five conditional SEM estimation methods used in this study were classified as either item-based or testlet-based methods. In general, the item-based methods provide lower estimates of the conditional SEM along the score scale than do the testlet-based methods.
- Research Article
42
- 10.1111/j.1745-3984.2000.tb01073.x
- Mar 1, 2000
- Journal of Educational Measurement
- Won‐Chan Lee + 2 more
This paper describes four procedures previously developed for estimating conditional standard errors of measurement for scale scores: the IRT procedure (Kolen, Zeng, & Hanson. 1996), the binomial procedure (Brennan & Lee, 1999), the compound binomial procedure (Brennan & Lee, 1999), and the Feldt‐Qualls procedure (1998). These four procedures are based on different underlying assumptions. The IRT procedure is based on the unidimensional IRT model assumptions. The binomial and compound binomial procedures employ, as the distribution of errors, the binomial model and compound binomial model, respectively. By contrast, the Feldt‐Qualls procedure does not depend on a particular psychometric model, and it simply translates any estimated conditional raw‐score SEM to a conditional scale‐score SEM. These procedures are compared in a simulation study, which involves two‐dimensional data sets. The presence of two category dimensions reflects a violation of the IRT unidimensionality assumption. The relative accuracy of these procedures for estimating conditional scale‐score standard errors of measurement is evaluated under various circumstances. The effects of three different types of transformations of raw scores are investigated including developmental standard scores, grade equivalents, and percentile ranks. All the procedures discussed appear viable. A general recommendation is made that test users select a procedure based on various factors such as the type of scale score of concern, characteristics of the test, assumptions involved in the estimation procedure, and feasibility and practicability of the estimation procedure.
- Research Article
24
- 10.1177/0013164499591001
- Feb 1, 1999
- Educational and Psychological Measurement
- Robert L Brennan + 1 more
This article develops two procedures for estimating individual-level conditional standard errors of measurement (SEMs) for scale scores, assuming tests consist of dichotomously scored items. Using binomial model assumptions, one procedure provides a scale-score analogue of Lord’s raw-score SEM. Assuming a compound binomial model, another procedure provides a scale-score analogue of Feldt’s raw-score SEM. These two procedures are compared to a polynomial procedure and a procedure developed by Feldt and Qualls using data from the Iowa Tests of Basic Skills.