Estimating Conditional Standard Errors of Measurement for Tests Composed of Testlets
The primary purposes of this study were to investigate both the appropriateness and the implications of incorporating a testlet definition into the estimation of the conditional standard error of measurement (SEM) for tests composed of testlets. When individual items are used as the fundamental measurement unit with tests composed of testlets, the assumptions required by measurement modeling are violated, but those assumptions are satisfied when testlet scores are used as the measurement unit. Therefore, item-based estimation methods probably introduce some magnitude of bias in the estimates of conditional SEMs for tests composed of testlets. The five conditional SEM estimation methods used in this study were classified as either item-based or testlet-based methods. In general, the item-based methods provide lower estimates of the conditional SEM along the score scale than do the testlet-based methods.
- Research Article
5
- 10.2466/pr0.98.1.237-252
- Feb 1, 2006
- Psychological Reports
A specific recommendation of the 1999 Standards for Educational and Psychological Testing by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education is that test publishers report estimates of the conditional standard error of measurement (SEM). Procedures for calculating the conditional (score-level) SEM based on raw scores are well documented; however, few procedures have been developed for estimating the conditional SEM of subtest or composite scale scores resulting from a nonlinear transformation. Item response theory provided the psychometric foundation to derive the conditional standard errors of measurement and confidence intervals for composite scores on the Wechsler Preschool and Primary Scale of Intelligence-Third Edition.
- Research Article
13
- 10.1111/j.1745-3984.2000.tb01078.x
- Jun 1, 2000
- Journal of Educational Measurement
The primary purpose of this study was to investigate the appropriateness and implication of incorporating a testlet definition into the estimation of procedures of the conditional standard error of measurement (SEM) for tests composed of testlets. Another purpose was to investigate the bias in estimates of the conditional SEM when using item‐based methods instead of testlet‐based methods. Several item‐based and testlet‐based estimation methods were proposed and compared. In general, item‐based estimation methods underestimated the conditional SEM for tests composed for testlets, and the magnitude of this negative bias increased as the degree of conditional dependence among items within testlets increased. However, an item‐based method using a generalizability theory model provided good estimates of the conditional SEM under mild violation of the assumptions for measurement modeling. Under moderate or somewhat severe violation, testlet‐based methods with item response models provided good estimates.
- Research Article
- 10.1177/00131644261420391
- Feb 25, 2026
- Educational and psychological measurement
This paper proposes an information-based analytic method for calculating the conditional standard error of measurement (CSEM) in multistage testing (MST) using maximum likelihood estimation. The accuracy of the proposed method was evaluated by comparing CSEMs computed using the analytic method with those obtained from simulation across the same four MST designs. The results show that analytic and simulation-based CSEMs converge as test length increases, indicating that the proposed method provides a reliable approximation for longer tests. However, shorter tests and more complex MST designs require additional items to achieve comparable accuracy. The study also compared the proposed method with Park et al.'s analytic approach. Practical implications of the proposed method are discussed.
- Research Article
79
- 10.1111/j.1745-3984.1992.tb00378.x
- Dec 1, 1992
- Journal of Educational Measurement
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw‐to‐scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw‐to‐scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.
- Research Article
2
- 10.1016/s0191-8869(00)00065-9
- Jan 31, 2001
- Personality and Individual Differences
Conditional standard error of measurement and personality scale scores: an investigation of classical test theory estimates with four MMPI scales
- Research Article
62
- 10.1111/j.1745-3984.1996.tb00485.x
- Jun 1, 1996
- Journal of Educational Measurement
An IRT method for estimating conditional standard errors of measurement of scale scores is presented, where scale scores are nonlinear transformations of number‐correct scores. The standard errors account for measurement error that is introduced due to rounding scale scores to integers. Procedures for estimating the average conditional standard error of measurement for scale scores and reliability of scale scores are also described. An illustration of the use of the methodology is presented, and the results from the IRT method are compared to the results from a previously developed method that is based on strong true‐score theory.
- Book Chapter
1
- 10.1002/9781118445112.stat06376
- Sep 29, 2014
- Wiley StatsRef: Statistics Reference Online
Errors of measurement for test scores generally are viewed as random and unpredictable. Conditional standard errors of measurement (SEMs) index the amount of error in the measurement process and are required to be reported for tests. Methods of estimating conditional SEMs, based on both classical test theory and generalizability theory, are described in this entry. These methods are based on differing assumptions, but the conditional SEMs estimates obtained from these methods are commonly highly related for educational achievement tests. Conditional SEMs vary by score level. Conditional SEMs typically have high values for extreme scores and have low values for middle scores on the raw score scale. However, a nonlinear transformation of the raw scores to scale scores can lead to conditional SEMs that have other patterns across score levels.
- Research Article
24
- 10.1177/0013164499591001
- Feb 1, 1999
- Educational and Psychological Measurement
This article develops two procedures for estimating individual-level conditional standard errors of measurement (SEMs) for scale scores, assuming tests consist of dichotomously scored items. Using binomial model assumptions, one procedure provides a scale-score analogue of Lord’s raw-score SEM. Assuming a compound binomial model, another procedure provides a scale-score analogue of Feldt’s raw-score SEM. These two procedures are compared to a polynomial procedure and a procedure developed by Feldt and Qualls using data from the Iowa Tests of Basic Skills.
- Book Chapter
- 10.1007/978-3-030-43469-4_17
- Jan 1, 2020
Conditional standard error of measurement (CSEM) indicates the level of measurement precision at a particular true score or ability level. Having a constant CSEM across all scores not only simplifies score interpretation and score reporting, but also contributes to the fairness of testing. This paper compares two fundamentally different approaches to achieving constant CSEMs: CSEM stabilizing scale transformations and computer adaptive tests (CATs) with fixed-precision stopping rules. Through conceptual comparison and empirical illustration, this study shows that the two approaches produce score scales that are nonlinearly related to each other, and each achieving the goal of equalizing the CSEMs on its own scale. Procedures for equalizing the CSEMs of a CAT that is not designed to have fixed precision are provided, and implications for transitioning from linear tests with equal CSEMs to CATs are also discussed.
- Research Article
5
- 10.1177/1534508420937801
- Jul 2, 2020
- Assessment for Effective Intervention
Curriculum-based measurement of oral reading fluency (CBM-R) is widely used across the country as a quick measure of reading proficiency that also serves as a good predictor of comprehension and overall reading achievement, but it has several practical and technical inadequacies, including a large standard error of measurement (SEM). Reducing the SEM of CBM-R scores has positive implications for educators using these measures to screen or monitor student growth. The purpose of this study was to compare the SEM of traditional CBM-R words correct per minute (WCPM) fluency scores and the conditional SEM (CSEM) of model-based WCPM estimates, particularly for students with or at risk of poor reading outcomes. We found (a) the average CSEM for the model-based WCPM estimates was substantially smaller than the reported SEMs of traditional CBM-R systems, especially for scores at/below the 25th percentile, and (b) a large proportion (84%) of sample scores, and an even larger proportion of scores at/below the 25th percentile (about 99%) had a smaller CSEM than the reported SEMs of traditional CBM-R systems.
- Research Article
17
- 10.1177/0146621606294206
- Jul 1, 2007
- Applied Psychological Measurement
This article introduces a multinomial error model, which models an examinee's test scores obtained over repeated measurements of an assessment that consists of polytomously scored items. A compound multinomial error model is also introduced for situations in which items are stratified according to content categories and/or prespecified numbers of score points. The multinomial and compound multinomial models are implemented in this article for estimating conditional standard errors of measurement and reliability. The applicability of the multinomial and compound multinomial models is illustrated with two real data examples. The first example considers test scores obtained from polytomous items only, and the second example contains test scores from a mixture of dichotomous and polytomous items. A simulation study is conducted to examine the amount and pattern of bias in the estimated conditional standard errors of measurement. Index terms: multinomial model, compound multinomial model, standard errors of measurement, reliability, scale scores
- Research Article
4
- 10.1007/s10459-011-9309-0
- Oct 1, 2011
- Advances in Health Sciences Education
Examinees who initially fail and later repeat an SP-based clinical skills exam typically exhibit large score gains on their second attempt, suggesting the possibility that examinees were not well measured on one of those attempts. This study evaluates score precision for examinees who repeated an SP-based clinical skills test administered as part of the US Medical Licensing Examination sequence. Generalizability theory was used as the basis for computing conditional standard errors of measurement (SEM) for individual examinees. Conditional SEMs were computed for approximately 60,000 single-take examinees and 5,000 repeat examinees who completed the Step 2 Clinical Skills Examination(®) between 2007 and 2009. The study focused exclusively on ratings of communication and interpersonal skills. Conditional SEMs for single-take and repeat examinees were nearly indistinguishable across most of the score scale. US graduates and IMGs were measured with equal levels of precision at all score levels, as were examinees with differing levels of skill speaking English. There was no evidence that examinees with the largest score changes were measured poorly on either their first or second attempt. The large score increases for repeat examinees on this SP-based exam probably cannot be attributed to unexpectedly large errors of measurement.
- Research Article
3
- 10.1007/s10459-010-9221-z
- Feb 3, 2010
- Advances in Health Sciences Education
The use of standardized patients to assess communication skills is now an essential part of assessing a physician's readiness for practice. To improve the reliability of communication scores, it has become increasingly common in recent years to use statistical models to adjust ratings provided by standardized patients. This study employed ordinary least squares regression to adjust ratings, and then used generalizability theory to evaluate the impact of these adjustments on score reliability and the overall standard error of measurement. In addition, conditional standard errors of measurement were computed for both observed and adjusted scores to determine whether the improvements in measurement precision were uniform across the score distribution. Results indicated that measurement was generally less precise for communication ratings toward the lower end of the score distribution; and the improvement in measurement precision afforded by statistical modeling varied slightly across the score distribution such that the most improvement occurred in the upper-middle range of the score scale. Possible reasons for these patterns in measurement precision are discussed, as are the limitations of the statistical models used for adjusting performance ratings.
- Book Chapter
1
- 10.1002/0470013192.bsa196
- Apr 15, 2005
- Encyclopedia of Statistics in Behavioral Science
Errors of measurement for test scores generally are viewed as random and unpredictable. Conditional standard errors of measurement (SEMs) index the amount of error in the measurement process and are required to be reported for tests. Methods of estimating conditional SEMs, based on both classical test theory and generalizability theory, are described in this entry. These methods are based on differing assumptions, but the conditional SEMs estimates obtained from these methods are commonly highly related for educational achievement tests. Conditional SEMs vary by score level. Conditional SEMs typically have high values for extreme scores and have low values for middle scores on the raw score scale. However, a nonlinear transformation of the raw scores to scale scores can lead to conditional SEMs that have other patterns across score levels.
- Research Article
2
- 10.1002/j.2333-8504.1991.tb01392.x
- Jun 1, 1991
- ETS Research Report Series
ABSTRACTA series of computer programs were written for computing the conditional standard errors of measurement (CSEM) for both rights‐scored and formula‐scored tests based on a method suggested in Lord (1984), commonly known as Lord's Method IV or the compound binomial method. These programs estimate the conditional standard errors of measurement for both raw and scaled scores, average results for two or more forms, and compute form‐to‐form difference statistics for pairs of forms.Conditional standard errors of measurement, averages, and differences have been computed for the verbal, quantitative, and analytical raw and converted scores for eight forms of the GRE General Test and for two forms each of 15 GRE Subject Tests.The Standards for Educational and Psychological Testing (Committee of AERA, APA, & NCME to Develop Standards, 1985) recommends that test publishers provide estimates of the standard error measurement at a number of widely spaced score levels. The CSEM data produced in this study have been made available to three programs which use GRE scores, along with other criteria, for awarding fellowships. These data also have been made available for use in GRE program publications and in correspondence.