Incomplete Validation of Risk Stratification Indices

Nathan L Pace

doi:10.1097/aln.0b013e31821f6585

Abstract

University of Utah, Salt Lake City, Utah. n.l.pace@utah.eduUsing 2001–2006 Medicare hospital data (Medicare Provider Analysis and Review [MEDPAR]) in approximately 17 million patients aged 65 yr and older, Sessler and colleagues1have proposed four risk stratification indices (RSIs) for mortality and duration of hospital stay. With a complex, stepwise hierarchical-selection algorithm, the authors1chose a parsimonious set of statistically significant predictors from the approximately 20,500 International Classification of Diseases, Ninth Revision, diagnostic and procedure codes. For example, in-hospital mortality was modeled on 184 predictor codes with odds ratios varying from 0.131 to 57.821. Using a split sample design, these RSIs were internally validated on MEDPAR data for another 17 million patients and were externally validated on 100,000 patient records from the Cleveland Clinic (Ohio; Perioperative Health Documentation System). Working in the parameter space (β coefficients), validation of the RSIs was demonstrated on the development, validation, and external datasets by the c (concordance) statistic,2which revealed very good discrimination in all datasets.However, the performance of these RSIs has not been adequately justified. To do so requires calculation of the prediction probability for each patient by exponentiation of the RSI (inverse logit; P i= 1/[1 +e −RSI]); P iranges from 0 to 1 (open interval). For each patient, prediction probability P iis compared with the observed dichotomous outcome Y i= 0 (dead) or Y i= 1 (alive). Overall performance of RSI is measured by the distance of the predicted outcome (P i) from the actual outcome (Y i); a good model of risk will have a short average distance. The accepted measures for overall performance in the validation datasets are the Brier score and the Nagelkerke R2statistic.2Overall performance can be partitioned into two characteristics: discrimination and calibration. Statistical software tools for estimation of overall performance, discrimination, and calibration are readily available.The c statistic is a measure of discrimination; it is a rank order statistic for predictions versus actual outcomes and is equivalent to the area under the receiver operating characteristic curve. As rank order statistics are invariant under monotonic transformations, the c statistic of RSI is identical to the c statistic of P i. Perfect discrimination corresponds with a c statistic of 1 and is achieved if the P ior RSI scores for all patients dying are higher than those for all patients not dying, with no overlap. A c statistic value of 0.5 indicates an RSI without discrimination (i.e. , no better than flipping a coin). While a good risk model will have high discrimination, by itself the c statistic is not optimal in assessing or comparing risk models.3The third aspect of performance measures is calibration (i.e. , the agreement between observed outcomes and predictions). For example, if an RSI score has a predicted probability of 20% for in-hospital mortality, then approximately 20% of inpatients with that RSI score are expected to die. The calibration of prediction probability can be assessed by regression plots of Y iversus P i, with patients grouped by deciles; there is also a specialized binary regression method.4Sessler et al. 1should be congratulated for their statistical models of risk that may, in the future, allow comparisons of outcomes of health care across institutions. I hope that they will provide supplementary analyses to demonstrate that, besides good discrimination, their RSIs also have good calibration and overall performance.University of Utah, Salt Lake City, Utah. n.l.pace@utah.edu

Full Text