The National Institutes of Health (NIH) Patient-Reported Outcomes Measurement Information System (PROMIS®) Roadmap initiative (www.nihpromis.org) is a cooperative research program designed to develop, evaluate, and standardize item banks to measure patient-reported outcomes (PROs) across different medical conditions as well as the US population (1). The goal of PROMIS is to develop reliable and valid item banks using item response theory (IRT) that can be administered in a variety of formats including short forms and computerized adaptive tests (CAT)(1-3). IRT is often referred to as “modern psychometric theory,” in contrast to “classic test theory,” or CTT. The basic idea behind both IRT and CTT is that there is some latent construct, or “trait,” underlying an illness experience. This construct cannot be directly measured, but can be indirectly measured by creating items that are scaled and scored. For example, “fatigue,” “pain,” “disability,” or even “happiness” are latent constructs, i.e. subjective feelings – we cannot take a picture, snap an X-Ray to view them, or run a blood test to check for them. However, we know they exist. People can experience more or less of these constructs, thus it is helpful to try to translate that experience into several levels represented by scores. IRT models the associations between items and the latent construct. Specifically, IRT models describe relationships between a respondent's underlying level on a construct and the probability of particular item responses. Tests developed with CTT (such as the Health Assessment Questionnaire-Disability Index(4), the Scleroderma Gastrointestinal Tract instrument(5)) require administering all items, even though only some are appropriate for the persons' trait level. Some items are too high for those with low trait levels (e.g., “can you walk 100 yards” to a patient in a wheelchair) or too low for those with high trait levels (e.g., “can you get up from the chair?” to a runner). In contrast, IRT methods make it possible to estimate person trait levels with any subset of items appropriate for the persons' trait levels in an item pool. As such, any set of items from the pool could be administered as a fixed form or, for greatest efficiency, administered as a CAT. CAT is an approach to administering the subset of items in an item bank that are most informative for measuring the health construct in order to achieve a target standard error of measurement. A good item bank will have items that represent a range of content and difficulty, provide high level of information, and have items that perform equivalently in different subgroups of the target population. How does CAT work? Without prior information, the first item administered in a CAT is typically one of medium trait level. For example, “In the past 7 days I was grouchy” with multi-level response from “never” to “always.” After each response, the person's trait level and associated standard error are estimated. The next item administered to someone not endorsing the first item, is an “easier” item. If the person endorses the first item, the next item administered is a “harder” item. CAT is terminated when the standard error falls below an acceptable value. This provides an estimate of one's score with the minimal number of questions and no loss of measurement precision. In addition, scores from different studies using different items can be compared using a common scale. IRT models estimate the underlying scale score (theta) from the items. All items are calibrated on the same metric and independently and collectively provide an estimate of theta. Hence, it is possible to estimate the score using any subset of items and to estimate the standard error of the estimated score. This allows assessment of health outcomes across patients with differing medical conditions (such as compare scores of someone with arthritis to someone with heart disease) at various degrees of physical and other impairments, both at the lowest and highest ends of trait levels.
Read full abstract