Reproducibility of Hospital Rankings Based on Centers for Medicare & Medicaid Services Hospital Compare Measures as a Function of Measure Reliability

Laurent G Glance,Bruce L Hall,Karen E Joynt Maddox,Andrew W Dick,David R Nerenz

doi:10.1001/jamanetworkopen.2021.37647

Abstract

Unreliable performance measures can mask poor-quality care and distort financial incentives in value-based purchasing. To examine the association between test-retest reliability and the reproducibility of hospital rankings. In a cross-sectional design, Centers for Medicare & Medicaid Services Hospital Compare data were analyzed for the 2017 (based on 2014-2017 data) and 2018 (based on 2015-2018 data) reporting periods. The study was conducted from December 13, 2020, to September 30, 2021. This analysis was based on 28 measures, including mortality (acute myocardial infarction, congestive heart failure, pneumonia, and coronary artery bypass grafting), readmissions (acute myocardial infarction, congestive heart failure, pneumonia, and coronary artery bypass grafting), and surgical complications (postoperative acute kidney failure, postoperative respiratory failure, postoperative sepsis, and failure to rescue). Measure reliability based on test-retest reliability testing. The reproducibility of hospital rankings was quantified by calculating the reclassification rate across the 2017 and 2018 reporting periods after categorizing the hospitals into terciles, quartiles, deciles, and statistical outliers. Linear regression analysis was used to examine the association between the reclassification rate and the intraclass correlation coefficient for each of the classification systems. The analytic cohort consisted of 28 measures from 4452 hospitals with a median of 2927 (IQR, 2378-3160) hospitals contributing data for each measure. The hospitals participating in the Inpatient Prospective Payment System (n = 3195) had a median bed size of 141 (IQR, 69-261), average daily census of 70 (IQR, 24-155) patients, and a median disproportionate share hospital percentage of 38.2% (IQR, 18.7%-36.6%). The median intraclass correlation coefficient was 0.78 (IQR, 0.72-0.81), ranging between 0.50 and 0.85. The median reclassification rate was 70% (IQR, 62%-71%) when hospitals were ranked by deciles, 43% (IQR, 39%-45%) when ranked by quartiles, 34% (IQR, 31%-36%) when ranked by terciles, and 3.8% (IQR, 2.0%-6.2%) when ranked by outlier status. Increases in measure reliability were not associated with decreases in the reclassification rate. Each 0.1-point increase in the intraclass correlation coefficient was associated with a 6.80 (95% CI, 2.28-11.30; P = .005) percentage-point increase in the reclassification rate when hospitals were ranked into performance deciles, 4.15 (95% CI, 1.16-7.14; P = .008) when ranked into performance quartiles, 1.47 (95% CI, 1.84, 4.77; P = .37) when ranked into performance terciles, and 3.70 (95% CI, 1.30-6.09; P = .004) when ranked by outlier status. In this study, more reliable measures were not associated with lower rates of reclassifying hospitals using test-retest reliability testing. These findings suggest that measure reliability should not be assessed with test-retest reliability testing.

Highlights

The Affordable Care Act and the Medicare Access and Children's Health Insurance Program Reauthorization Act were intended to expand health insurance coverage, improve health care quality, and control the growth of health care spending
Each 0.1-point increase in the intraclass correlation coefficient was associated with a 6.80 percentage-point increase in the reclassification rate when hospitals were ranked into performance deciles, 4.15 when ranked into performance quartiles, 1.47 when ranked into performance terciles, and 3.70 when ranked by outlier status
In this study, more reliable measures were not associated with lower rates of reclassifying hospitals using test-retest reliability testing. These findings suggest that measure reliability should not be assessed with test-retest reliability testing

Summary

Introduction

The Affordable Care Act and the Medicare Access and Children's Health Insurance Program Reauthorization Act were intended to expand health insurance coverage, improve health care quality, and control the growth of health care spending. The NQF measure evaluation algorithm does not currently prescribe a numeric threshold for acceptable reliability.[6] In practice, the NQF Scientific Methods Panel,[7] which is charged with evaluating the reliability and validity of complex measures, has used 0.7 as the threshold for acceptable reliability[8,9,10,11] and has considered 0.5 to 0.69 as borderline acceptable These thresholds are similar to those in the Landis scale, which specifies arbitrary thresholds to quantify the measurement of observer agreement for categorical data.[12] The Landis scale was not, created to evaluate measure reliability.

Methods

Results

Discussion

Conclusion