Abstract
Item response theory (IRT) scoring of health status questionnaires offers many advantages. However, to ensure 'backwards comparability' and to facilitate interpretations of results, we need the ability to express the IRT score in the metrics of the traditional scales. To develop procedures to calibrate IRT-based scores on the Headache Impact Test (HIT) into the metrics of the traditional headache scales. To assess the degree to which the calibrated HIT scores agree with the observed traditional scores and lead to the same conclusions in group comparisons. We used telephone interview data (n = 1016) and Internet data (n = 1103) from general population surveys of recent headache sufferers. Analyses were conducted in four steps: (1) develop IRT models for all items, (2) for each IRT score level, calculate the expected score on each of the traditional scales (calibration), (3) adjust this calibrated score for measurement error in the IRT score, (4) for each of the traditional scales, assess agreement between calibrated HIT scores and observed scores using intraclass correlation (ICC) and evaluate the agreement of mean scores and the relative validity (RV) in discriminating among groups differing in migraine diagnosis, headache severity, and change in impact over time. For the traditional categorical questionnaire items (the Migraine Specific Questionnaire (MSQ) and the Headache Disability Inventory (HDI)) the calibrated HIT agreed with the observed traditional scores: ICC's were between 0.80 and 0.94. In RV analyses the maximum mean difference between the observed and expected scores was 1.7 points on a 0-100 scale for comparisons at one point in time. Analyses of change over time and analyses calibrating scores from the fixed-form HIT-6 to the metric of other questionnaires were also satisfactory although less precise. Analysis of non-standard questionnaire items (e.g. On how many days in the past 3 months did you have a headache, from the HIMQ and the MIDAS) required special IRT models. Agreement was less good: ICC's were between 0.56 and 0.61 and the maximum mean differences were 2.9 (on a 0-270 scale) and 3.8 (on a 0-450 scale) in RV analyses at one point in time. The ability of the calibrated scale scores to discriminate between groups was at least as good as the ability of the observed sum scales and often remarkably better. The theoretical advantage of IRT models in scale calibration is supported by our results. This approach to achieving comparability of new and widely-used scales and accelerating the accumulation of interpretation guidelines based on previous work warrant testing for measures of other generic and disease-specific concepts.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.