Estimation of Conditional Standard Errors of Measurement for MLE Scores in MST.
This paper proposes an information-based analytic method for calculating the conditional standard error of measurement (CSEM) in multistage testing (MST) using maximum likelihood estimation. The accuracy of the proposed method was evaluated by comparing CSEMs computed using the analytic method with those obtained from simulation across the same four MST designs. The results show that analytic and simulation-based CSEMs converge as test length increases, indicating that the proposed method provides a reliable approximation for longer tests. However, shorter tests and more complex MST designs require additional items to achieve comparable accuracy. The study also compared the proposed method with Park et al.'s analytic approach. Practical implications of the proposed method are discussed.
- Research Article
5
- 10.2466/pr0.98.1.237-252
- Feb 1, 2006
- Psychological Reports
A specific recommendation of the 1999 Standards for Educational and Psychological Testing by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education is that test publishers report estimates of the conditional standard error of measurement (SEM). Procedures for calculating the conditional (score-level) SEM based on raw scores are well documented; however, few procedures have been developed for estimating the conditional SEM of subtest or composite scale scores resulting from a nonlinear transformation. Item response theory provided the psychometric foundation to derive the conditional standard errors of measurement and confidence intervals for composite scores on the Wechsler Preschool and Primary Scale of Intelligence-Third Edition.
- Research Article
12
- 10.1207/s15324818ame1302_3
- Apr 1, 2000
- Applied Measurement in Education
The primary purposes of this study were to investigate both the appropriateness and the implications of incorporating a testlet definition into the estimation of the conditional standard error of measurement (SEM) for tests composed of testlets. When individual items are used as the fundamental measurement unit with tests composed of testlets, the assumptions required by measurement modeling are violated, but those assumptions are satisfied when testlet scores are used as the measurement unit. Therefore, item-based estimation methods probably introduce some magnitude of bias in the estimates of conditional SEMs for tests composed of testlets. The five conditional SEM estimation methods used in this study were classified as either item-based or testlet-based methods. In general, the item-based methods provide lower estimates of the conditional SEM along the score scale than do the testlet-based methods.
- Research Article
62
- 10.1111/j.1745-3984.1996.tb00485.x
- Jun 1, 1996
- Journal of Educational Measurement
An IRT method for estimating conditional standard errors of measurement of scale scores is presented, where scale scores are nonlinear transformations of number‐correct scores. The standard errors account for measurement error that is introduced due to rounding scale scores to integers. Procedures for estimating the average conditional standard error of measurement for scale scores and reliability of scale scores are also described. An illustration of the use of the methodology is presented, and the results from the IRT method are compared to the results from a previously developed method that is based on strong true‐score theory.
- Research Article
79
- 10.1111/j.1745-3984.1992.tb00378.x
- Dec 1, 1992
- Journal of Educational Measurement
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw‐to‐scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw‐to‐scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.
- Book Chapter
- 10.1007/978-3-030-43469-4_17
- Jan 1, 2020
Conditional standard error of measurement (CSEM) indicates the level of measurement precision at a particular true score or ability level. Having a constant CSEM across all scores not only simplifies score interpretation and score reporting, but also contributes to the fairness of testing. This paper compares two fundamentally different approaches to achieving constant CSEMs: CSEM stabilizing scale transformations and computer adaptive tests (CATs) with fixed-precision stopping rules. Through conceptual comparison and empirical illustration, this study shows that the two approaches produce score scales that are nonlinearly related to each other, and each achieving the goal of equalizing the CSEMs on its own scale. Procedures for equalizing the CSEMs of a CAT that is not designed to have fixed precision are provided, and implications for transitioning from linear tests with equal CSEMs to CATs are also discussed.
- Research Article
2
- 10.1016/s0191-8869(00)00065-9
- Jan 31, 2001
- Personality and Individual Differences
Conditional standard error of measurement and personality scale scores: an investigation of classical test theory estimates with four MMPI scales
- Research Article
24
- 10.1177/0013164499591001
- Feb 1, 1999
- Educational and Psychological Measurement
This article develops two procedures for estimating individual-level conditional standard errors of measurement (SEMs) for scale scores, assuming tests consist of dichotomously scored items. Using binomial model assumptions, one procedure provides a scale-score analogue of Lord’s raw-score SEM. Assuming a compound binomial model, another procedure provides a scale-score analogue of Feldt’s raw-score SEM. These two procedures are compared to a polynomial procedure and a procedure developed by Feldt and Qualls using data from the Iowa Tests of Basic Skills.
- Research Article
5
- 10.1080/00131725.2011.602467
- Oct 1, 2011
- The Educational Forum
Test score validity takes center stage in the debate over the use of high school exit exams. Scant literature addresses the amount of conditional standard error of measurement (CSEM) present in individual student results on high school exit exams. The purpose of this study is to fill a void in the literature and add a national review of the CSEM, including data on the amount of CSEM present in high school exit exams results. Individual student results from each of the 23 exit exams contained a CSEM ranging from 3.29 to 39 scale-score points. Nearly one-fourth of the state education agencies did not report the CSEM for the individual student results.
- Book Chapter
1
- 10.1002/9781118445112.stat06376
- Sep 29, 2014
- Wiley StatsRef: Statistics Reference Online
Errors of measurement for test scores generally are viewed as random and unpredictable. Conditional standard errors of measurement (SEMs) index the amount of error in the measurement process and are required to be reported for tests. Methods of estimating conditional SEMs, based on both classical test theory and generalizability theory, are described in this entry. These methods are based on differing assumptions, but the conditional SEMs estimates obtained from these methods are commonly highly related for educational achievement tests. Conditional SEMs vary by score level. Conditional SEMs typically have high values for extreme scores and have low values for middle scores on the raw score scale. However, a nonlinear transformation of the raw scores to scale scores can lead to conditional SEMs that have other patterns across score levels.
- Dissertation
- 10.17077/etd.i71hp3na
- Sep 5, 2018
<p>The importance of measuring and monitoring educational achievement longitudinally has led to a proliferation of growth models. The Student Growth Percentile (SGP) is one score metric which helps to make inferences about current relative student status given prior test scores. The major purpose of this study was to provide two Conditional Standard Errors of Measurement (CSEM) estimation approaches for individual-level SGPs with theoretical justifications and empirical elaborations of them. Estimation approaches were developed under two commonly used paradigms: Classical Test Theory (CTT) and Item Response Theory (IRT). Within each paradigm, measurement error was conceptualized as variability of individual-level test scores across hypothetical repeated measurement using parallel test forms. Under the CTT paradigm, the measurement errors were assumed to be distributed as a binomial model. Under the IRT paradigm, they were assumed to be distributed as a compound binomial model. In addition to CSEMs, the purpose of this study was to develop procedures for constructing individual-level SGP confidence intervals and for estimating reliability. The proposed methods were demonstrated using data for a large-scale assessment of mathematics achievement from Grades 3 to 4. For example, pertinent tables and graphs including outcome statistics showed that the mean and median values of CSEMs for individual SGPs were sizable, the length of tests influenced actual values of CSEM for SGP, but there were small differences in CSEM values between the two types of conversion relationships. The CSEM values on the SGP scale by each academic peer group were distributed in an arch shape. Also, compared to the SGP reliabilities under CTT, those under IRT had similar reliability coefficients in the three tests. The results of these demonstrations were used to evaluate measurement errors in the context of practical and policy implications of SGP use. In final chapter, the practical use of SGPs and important considerations regarding measurement issues are provided. Further research related to SGPs using different subjects or grade levels, or simulation studies on the effective of the developed methodologies are also discussed.</p>
- Research Article
18
- 10.1002/j.2333-8504.2007.tb02046.x
- Jun 1, 2007
- ETS Research Report Series
ABSTRACTTraditionally, the fixed‐length linear paper‐and‐pencil (P&P) mode of administration has been the standard method of test delivery. With the advancement of technology, however, the popularity of administering tests using adaptive methods like computerized adaptive testing (CAT) and multistage testing (MST) has grown in the field of measurement in both theory and practice. In practice, several standardized tests have sections that include only set‐based items. To date, there is no study in the literature that compares these testing procedures when a test is completely set‐based under various item response theory (IRT) models. This study investigates the measurement precision of MST compared to CAT and compared to P&P tests for the one‐, two‐, and three‐parameter logistic (1‐, 2‐, and 3PL) models when the test is completely set‐based. Results showed that MST performed better for the 2‐ and 3PL models than an equivalent‐length P&P test in terms of reliability and conditional standard error of measurement. In addition, findings showed that MST performed better for the 1‐ and 2PL models than for an equivalent‐length CAT test. For the 3PL model, MST and CAT performed about the same.
- Research Article
20
- 10.1111/j.1745-3984.1990.tb00743.x
- Sep 1, 1990
- Journal of Educational Measurement
Previous methods for estimating the conditional standard error of measurement (CSEM) at specific score or ability levels are critically discussed, and a brief summary of prior empirical results is given. A new method is developed that avoids theoretical problems inherent in some prior methods, is easy to implement, and estimates not only a quantity analogous to the CSEM at each score but also the conditional standard error of prediction (CSEP) at each score and the conditional true score standard deviation (CTSSD) at each score, The new method differs from previous methods in that previous methods have concentrated on attempting to estimate error variance conditional on a fixed value of true score, whereas the new method considers the variance of observed scores conditional on a fixed value of an observed parallel measurement and decomposes these conditional observed score variances into true and error parts. The new method and several older methods are applied to a variety of tests, and representative results are graphically displayed. The CSEM‐Iike estimates produced by the new method are called conditional standard error of measurement in prediction (CSEMP) estimates and are similar to those produced by older methods, but the CSEP estimates produced by the new method offer an alternative interpretation of the accuracy of a test at different scores. Finally, evidence is presented that shows that previous methods can produce dissimilar results and that the shape of the score distribution may influence the way in which the CSEM varies across the score scale.
- Research Article
20
- 10.1080/15305058.2011.617476
- Jan 1, 2012
- International Journal of Testing
Composite scores are often formed from test scores on educational achievement test batteries to provide a single index of achievement over two or more content areas or two or more item types on that test. Composite scores are subject to measurement error, and as with scores on individual tests, the amount of error variability typically depends on the individual's score level. A procedure is presented for estimating conditional standard errors of measurement and reliability for composite scores. Item response theory (IRT) models are used as the psychometric foundation for developing the procedure. First, a general procedure is described, followed by specific applications for estimating conditional standard errors of measurement of the ACT Assessment composite and a weighted summed score on a mathematics test. General issues in estimating conditional standard errors of measurement for composite scores are discussed.
- Dissertation
2
- 10.17077/etd.ij5duu44
- Jan 3, 2012
<p>The equity properties can be used to assess the quality of an equating. The degree to which expected scores conditional on ability are similar between test forms is referred to as first-order equity. Second-order equity is the degree to which conditional standard errors of measurement are similar between test forms after equating. The purpose of this dissertation was to investigate the use of a multidimensional IRT framework for assessing first- and second-order equity of mixed format tests.</p> <p>Both real and simulated data were used for assessing the equity properties for mixed-format tests. Using real data from three Advanced Placement (AP) exams, five different equating methods were compared in their preservation of first- and second-order equity. Frequency estimation, chained equipercentile, unidimensional IRT true score, unidimensional IRT observed score, and multidimensional IRT observed score equating methods were used. Both a unidimensional IRT framework and a multidimensional IRT framework were used to assess the equity properties. Two simulation studies were also conducted. The first investigated the accuracy of expected scores and conditional standard errors of measurement as tests became increasingly multidimensional using both a unidimensional IRT framework and multidimensional IRT framework. In the second simulation study, the five different equating methods were compared in their ability to preserve first- and second-order equity as tests became more multidimensional and as differences in group ability increased.</p> <p>Results from the real data analyses indicated that the performance of the equating methods based on first- and second-order equity varied depending on which framework was used to assess equity and which test was used. Some tests showed similar preservation of equity for both frameworks while others differed greatly in their assessment of equity. Results from the first simulation study showed that estimates of expected scores had lower mean squared error values when the unidimensional framework was used compared to when the multidimensional framework was used when the correlation between abilities was high. The multidimensional IRT framework had lower mean squared error values for conditional standard errors of measurement when the correlation between abilities was less than .95. In the second simulation study, chained equating performed better than frequency estimation for first-order equity. Frequency estimation better preserved second-order equity compared to the chained method. As tests became more multidimensional or as group differences increased, the multidimensional IRT observed score equating method tended to perform better than the other methods.</p>
- Research Article
- 10.1002/j.2333-8504.1993.tb01559.x
- Dec 1, 1993
- ETS Research Report Series
ABSTRACTThis paper presents a method for estimating the accuracy and consistency of classifications based on test scores. The scores can be produced by any scoring method, including the formation of a weighted composite. The estimates use data from a single form. The reliability of the score is used to estimate its effective test length in terms of discrete items. The true‐score distribution is estimated by fitting a four‐parameter beta model. The conditional distribution of scores on an alternate form, given the true score, is estimated from a binomial distribution based on the estimated effective test length. The agreement between classifications on two alternate forms is estimated by assuming conditional independence, given the true score.An evaluation of the method showed that the estimates of the percent of test‐takers correctly classified and the percent consistently classified were within one percentage point of the actual values in most cases. Although the estimated effective test length and the estimates of the conditional standard error of measurement are sensitive to changes in the specified minimum and maximum possible scores, the estimates of the decision accuracy and decision consistency statistics are not.