, Taylor andMcPherson report using a form of item-response theory(IRT) called Rasch analysis to evaluate the Health Assess-ment Questionnaire (HAQ) disability index and the ShortForm 36 (SF-36) physical function score (1). Methods sim-ilar to Rasch analysis were used to develop scores for theScholastic Aptitude Test (SAT), an evaluation of academicskills used in college admissions evaluations that is famil-iar to US readers. One of the underlying theorems of IRT isthat if you can answer a difficult question, you can alsoanswer (probabilistically) all easier questions, and, there-fore, questions can be ranked by difficulty. One might rankmathematic ability as ordered by addition, subtraction,multiplication,division,algebra,calculus,etc.Inaddition,Rasch analysis can assign a difficulty score to each item(question). With such information, the level of academicskill can be measured. Assessments such as the SAT haveto be reliable. Reliability means that, for example, if youadminister the SAT twice, perhaps a month apart, you willget very similar results. The variability between the 2administration results is a measure of test reliability, andcan be expressed as a correlation coefficient. The reliabil-ity of SAT tests is approximately 0.90 (2).It is useful to think of disability assessment as beingsimilar to academic skills assessment. Using Rasch analy-sis, one can assign difficulty scores to individual disabilityassessment questions (3,4). For example, it is easier to lifta cup to your mouth or walk on flat ground than it is to dooutside work or walk 2 miles. If the disability scale isconstructed correctly, disability can be measured on alinear continuous scale, just as one would use a thermom-eter to measure body temperature.Practically,however,questionnaires(andweextendthisdiscussion to visual analog scales for pain, etc.) deviatesignificantly from the ideal. Although the SAT actuallytests academic ability, clinical questionnaires test percep-tion of functional ability or pain. Given the same painfulstimulus, patients will differ in their assessments of theseverity of pain. Similarly, people with the same level ofphysical ability will differ in their assessment of theirphysical ability.Therefore, the first problem with clinical assessments isthat there is no absolute standard. However, we can as-sume that each person has his or her own innate, internalstandard, and that standard is usually not too distant fromthe mean of all patients. Clinicians know that some pa-tients are high reporters of pain and disability while othersare low reporters, and they make mental adjustmentsbased on such facts. This wavering standard, while ofconsiderable importance, defines interpretability, but doesnot have anything to do with reliability. Interpretability isa keystone of clinical evaluations, but not of clinical trialsor blind evaluation of scores.The second problem with questionnaire assessments isthat they most often have poor reliability. This is not onlyrelated to HAQ scores (r 0.85) or pain scores (r 0.7–0.8), but also to physician’s global, joint tendernessand swelling counts (r 0.8), and to the Disease ActivityScore (DAS) (r 0.8) (5–7). From reliability we canestimate the minimal detectable change (MDC), also calledthe reliable change or the smallest real difference (8). Forthe technically minded, the MDC SEM 1.96 2,where the SEM SD 1 reliability. The consequenceof this is that, given 2 measurements in an individualpatient, we can only say with confidence that differencesbetween 2 assessments that are equal to or exceed 0.75for HAQ, 2.0 for DAS, and 3.5 for VAS pain are statis-tically significant. The reason that the necessary differ-ences are so great is that the uncertainty (reliability esti-mate) is applied not to the change score but to each of the2 test measurements.There are 3 settings in which reliability is important:clinical trials in groups of patients, blind assessments inindividual patients, and informed assessments in clinicpatients. In clinical trials, low reliability can be overcomeby increasing sample size. Blind assessments occur whenone tries to interpret 2 scores (i.e., before and after treat-ment) without the use of additional information. Suchassessments are made by insurance companies, third-partyproviders, and regulatory authorities, among others, to see
Read full abstract