On the Unreliability of Test-Retest Reliability.
The Test-Retest Coefficient (TRC) is a central metric of reliability in Classical Test Theory and modern psychological assessments. Originally developed by early 20th-century psychometricians, it relies on the assumptions of fixed (i.e., perfectly stable) true scores and independent error scores. However, these assumptions are rarely, if ever, tested, despite the fact that their violation can introduce significant biases. This article explores the foundations of these assumptions and examines the performance of the TRC under varying conditions, including different sample sizes, true score stability, and error score dependence. Using simulated data, results show that decreasing true score stability biases TRC estimates, leading to underestimations of reliability. Additionally, error score dependence can inflate TRC values, making unreliable measures appear reliable. More fundamentally, when these assumptions are violated, the TRC becomes underidentified, meaning that multiple, substantively different data-generating processes can yield the same coefficient, thus undermining its interpretability. These findings call into question the TRC's suitability for applied settings, especially when traits fluctuate over time or measurement conditions are uncontrolled. Alternative approaches are briefly discussed.
- Research Article
- 10.56989/tecdwf79
- Feb 15, 2025
- Contemporary Studies Journal in Education and Psychology
This paper aimed to shed light on the Classical Test Theory (CTT), which is one of the simplest and oldest theories in psychological and educational measurement. It traces CTT's development from the early 20th century by pioneers like Charles Spearman, Galton, Lord, and Toffler. Also, it aimed to clarify the basic concepts of the theory, like true score, error score, reliability, and validity, highlighting their role in test development. The paper also focused on practical applications of CTT in measuring human traits and abilities, emphasizing its influence on test score interpretation. The results that the article reached are: CTT distinguishes between observed score (X), true score (T), and error score (E). True score reflects actual ability, while error score represents random influences. Reliability is calculated as the ratio of true score variance to total variance. Despite the appearance of new theories like Item Response Theory (IRT), CTT remains widely used due to its simplicity and ease of application.
- Front Matter
27
- 10.1016/s1551-7144(09)00212-2
- Jan 1, 2010
- Contemporary Clinical Trials
Classical and modern measurement theories, patient reports, and clinical outcomes
- Book Chapter
18
- 10.1016/b978-0-08-097086-8.44006-7
- Jan 1, 2015
Classical (Psychometric) Test Theory
- Book Chapter
4
- 10.4324/9780367815318-15
- May 18, 2021
Rasch measurement theory is a framework for measurement that is defined by essential scientific principles based on the concepts of specific objectivity and invariance applied to models of measurement. One of the oldest traditions in measurement theory is based on the simple sum score. Classical test theory is illustrative of this tradition. Classical test theory defines an observed score as being composed of two components: a true score and error score. With a few simple assumptions, classical test theory can be used to obtain several useful indices of the psychometric quality of a set of scores related to the consistency, reliability, and precision of test scores. The focus of measurement models in the scaling tradition reflects the development of probabilistic models for individual person responses to each item included in a measurement instrument. The models are non-linear, and their major benefit is that they facilitate the development of an invariant scale to represent a latent variable or construct.
- Research Article
27
- 10.1007/bf02297848
- Sep 1, 1971
- Psychometrika
A general one-way analysis of variance components with unequal replication numbers is used to provide unbiased estimates of the true and error score variance of classical test theory. The inadequacy of the ANOVA theory is noted and the foundations for a Bayesian approach are detailed. The choice of prior distribution is discussed and a justification for the Tiao-Tan prior is found in the particular context of the “n-split” technique. The posterior distributions of reliability, error score variance, observed score variance and true score variance are presented with some extensions of the original work of Tiao and Tan. Special attention is given to simple approximations that are available in important cases and also to the problems that arise when the ANOVA estimate of true score variance is negative. Bayesian methods derived by Box and Tiao and by Lindley are studied numerically in relation to the problem of estimating true score. Each is found to be useful and the advantages and disadvantages of each are discussed and related to the classical test-theoretic methods. Finally, some general relationships between Bayesian inference and classical test theory are discussed.
- Research Article
4
- 10.1080/00220973.1966.11010974
- Sep 1, 1966
- The Journal of Experimental Education
where r00 is reliability, s| is error variance, and Sq is observed variance. In previous papers (1, 2) the case of non-inde pendence of true scores and error scores was con sidered. An example of non-independence of the two components is found in multiple-choice tests, where chance success due to guessing is possible. Here, true scores and error scores are negatively corre lated. A person with low true score is presented with a large number of items on which suc cessful guesses are possible. On the other hand, a person with a high true score guesses on only a small num ber of items. The error introduced by guessing thus depends upon true score. While the error com ponent corresponding to a particular true score is variable, according to a binomial distribution, the mean of that distribution also varies with true score. The result is a negative correlation between true scores and error scores for any given set of true scores.
- Book Chapter
1
- 10.1007/978-981-13-7496-8_3
- Jan 1, 2019
Classical test theory (CTT) rests on the assumption of a normal distribution of scores in some population and assumes scores are not at the extremes of the possible range. In CTT, a person’s observed test score is a sum of a true score and an error score. The test’s reliability is the central index of CTT and is the ratio of true score variance to observed score variance. A person’s true score, with a confidence interval, is estimated from the observed test score using the reliability index. Although not formalized in CTT, two descriptive indices used in CTT are the facility and the discrimination of an item. The former is the percentage of persons who answer an item correctly, and the latter is the correlation between the scores on the item and the scores on the test. The latter values are expected to be similar across items.
- Research Article
27
- 10.2466/pr0.1965.17.1.159
- Aug 1, 1965
- Psychological Reports
The effect of chance success due to guessing upon the variance of multiple-choice test scores was estimated from prepared distributions of large numbers of scores. Each score consisted of an assumed “true score” component and an “error score” component generated by a computer. A large negative correlation was found between true scores and error scores and a positive correlation between error scores and error scores. The equation showing reliability in terms of components of variance was derived under the more restrictive assumption that there is a correlation between true scores and error scores, and the result [Formula: see text] was obtained. The fact that reliability can be positive even though error variance and observed variance are equal was discussed.
- Book Chapter
- 10.4324/9781138609877-ree26-1
- May 30, 2022
A team of fourth-grade teachers are preparing lessons for the upcoming week and are using an assessment of grade-level mathematics readiness to select appropriate supplemental mathematics materials for groups of students. The teachers have in front of them item-level responses for each student, which they need to aggregate in order to form a total test score from which they can draw inferences. This process of summing up items answered correctly to form a total test score (number or per cent correct) is at the very foundation of classical test theory. Within this framework, the total test score is referred to as an observed score. This shows the level of performance that we see, but not the performance that the test taker might be capable of ultimately. Classical test theory helps test developers and test users, like the fourth-grade team of teachers described here, to understand the discrepancy between observed score and the measure of a person’s true capability, referred to here as a true score. This theory also provides ways of evaluating assessments so that test developers and users alike can rely on these tools to inform practice. Classical test theory (CTT), also referred to as classical true score theory, like other test theories is a ‘symbolic representation of factors influencing observed test scores’ (Allen and Yen 1979, 56). This simple model is governed by a set of assumptions and their resulting conclusions which describe how errors of measurement affect observed scores on measurement instruments. Classical test theory relies mainly on the assumption that the observed score (X) is a function of the sum of random error (E) and true score (T) (ibid., 57; Crocker and Algina 1986, 107; Hambleton and Jones 1993, 40). Test development and evaluation is often based on the standard procedures of CTT (Allen and Yen 1979, 56). Psychometricians and test developers commonly utilise CTT to compute the reliability of test scores, evaluate the validity of test scores, perform item analysis, estimate variance components to evaluate sources of error, and equate test scores for various purposes. These purposes are discussed in greater detail in the following sections. This entry begins with a brief history of classical test theory, followed by an explanation of the classical true score model. Reliability and the concept of item analysis are then examined, followed by a brief acknowledgement of the limitations of CTT and a short overview of its counterpart, item response theory.
- Book Chapter
2
- 10.1007/978-94-017-1988-9_2
- Jan 1, 1985
Any mathematical model includes a set of assumptions about the data to which the model applies, and specifies the relationships among observable and unobservable constructs described in the model. Consider as an example the well-known classical test model. With the classical test model, two unobservable constructs are introduced: true score and error score. The true score for an examinee can be defined as his or her expected test score over repeated administrations of the test (or parallel forms). An error score can be defined as the difference between true score and observed score. The classical test model also postulates that (1) error scores are random with a mean of zero and uncorrelated with error scores on a parallel test and with true scores, and (2) true scores, observed scores, and error scores are linearly related.
- Research Article
1
- 10.52589/bjeldp-iqrlmlzz
- Jul 1, 2024
- British Journal of Education, Learning and Development Psychology
Over the years, measurement experts have been captivated by the description of students, which has resulted in the development of test theories such as Item Response Theory and Classical Test Theory. The traditional method of item analysis, known as "Classical Test Theory," asserts that an individual's observed score on an exam is equal to their true score and an error score, with all items in the test contributing equally to student performance. Assessment, in this context, refers to any method used to gauge a learner's current knowledge. The significance of Classical Test Theory in teaching, learning, and evaluating learning outcomes has spurred academic inquiry. This paper explored the application of Classical Test Theory in tertiary institution assessment, emphasizing its relevance in evaluating learning outcomes. Some notable points included the simplicity of mathematical procedures in classical test analysis and the straightforwardness of model parameter estimation. Additionally, this paper advocated for the utilization of statistical sophistication inherent in Classical Test Theory to interpret undergraduates' performance effectively. Lecturers were encouraged to familiarize themselves with its application to provide meaningful insights into students' performance.
- Research Article
7
- 10.1080/00220973.1967.11011019
- Jun 1, 1967
- The Journal of Experimental Education
IN PREVIOUS papers (7, 8) it has been shown that chance success due to guessing introduces an unavoidable source of error into multiple-choice test scores. This particular class of error is neg atively correlated with true scores. The usual equa tions for test reliability and other intercorrelations among components of test scores depend upon the assumption that the correlations between true scores and error scores and between error scores and er ror scores on parallel forms of a test are zero. In previous papers (6, 8, 9, 10) more general equa tions for these intercorrelation terms, which do not depend upon the above assumptions, have been pre sented. Because of the presence of chance success due to guessing the reliability of a multiple-choice test has a maximum value. In other words, if all sources of error other than chance success due to guessing were eliminated, the reliability of a test would re main at some value less than unity because of the unavoidable error due to guessing. The computer simulation method described previously (8) gave re liabilities for several kinds of tests, under the as sumption that only error due to guessing is present. The purpose of this paper is to determine these val ues using analytic methods. An equation for the maximum reliability of a multiple-choice test, which involves only number of items, number of choices, and mean and variance of true scores (grouphetero geneity) is derived. Horst (2) derived equations indicating the maxi mum correlation between two different tests. Be ginning with these, Roberts (5) derived equations for maximum reliability of a test. These results in volve item difficulties and are based on assumptions concerning intercorrelations among items. The re lation of number of alternative choices to test reli ability has also been investigated by Carroll (1), Lord (3), and Plumlee (4). The present paper dif fers from these approaches to the problem in that it does not involve item difficulties, but considers only components of variance of test scores. It in volves no assumptions about intercorrelations among items and holds for the case in which there is a neg ative correlation between true scores and error scores introduced by guessing. The result is rela tively simple in form.
- Research Article
76
- 10.1016/j.tics.2021.05.008
- Sep 1, 2021
- Trends in cognitive sciences
Striving toward translation: strategies for reliable fMRI measurement.
- Research Article
2
- 10.3102/10769986231184147
- Jul 19, 2023
- Journal of Educational and Behavioral Statistics
A general framework of latent trait item response models for continuous responses is given. In contrast to classical test theory (CTT) models, which traditionally distinguish between true scores and error scores, the responses are clearly linked to latent traits. It is shown that CTT models can be derived as special cases, but the model class is much wider. It provides, in particular, appropriate modeling of responses that are restricted in some way, for example, if responses are positive or are restricted to an interval. Restrictions of this sort are easily incorporated in the modeling framework. Restriction to an interval is typically ignored in common models yielding inappropriate models, for example, when modeling Likert-type data. The model also extends common response time models, which can be treated as special cases. The properties of the model class are derived and the role of the total score is investigated, which leads to a modified total score. Several applications illustrate the use of the model including an example, in which covariates that may modify the response are taken into account.
- Research Article
37
- 10.1016/0022-2496(77)90063-3
- Oct 1, 1977
- Journal of Mathematical Psychology
The theory of test validity and correlated errors of measurement
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.