Statistical and clinical perspectives on risk models can be very different, even with agreement on the objective: to develop accurate and precise risk estimates for rational, effective, and cost-effective prevention strategies (1). For example, Katki et al. (2) estimated the risk of cervical precancer (cervical intraepithelial neoplasia grade 3) or worse conditional on results of concurrent testing for human papillomavirus (HPV) and cervical cytology (cotesting) in 300,000 women 30 years of age or older who were enrolled in a large health maintenance organization. The nearly equivalent risk over 5 years for all the women who were HPV negative (3.8 per 100,000 per year) and those who were both HPV negative and cytology negative (3.2 per 100,000 per year) might justify a prevention program using primary screening based on results of HPV testing alone. In contrast, among 17,000 HPV-positive women, the 5-year risk of cervical intraepithelial neoplasia grade 3 or worse was substantially lower (5.9%) in the 12,000 women with normal cytology than it was in the 5,000 women who had abnormal cytology (12.1%). This suggests that cytology testing might improve decisions about management of women who test positive for HPV. As exemplified by Katki et al. (2), the value of an additional clinical test is a function of simple risks or generalizations of predictive values, along with proportions with the different clinical presentations. Both a risk model and a clinical test are risk classifiers or risk stratifiers. For evaluation of risk models, as Cook (3) notes in one commentary discussing an article by Pencina et al. (4) in this issue of the Journal, neither area under the receiver operating characteristic curve (AUC), integrated discrimination improvement (IDI), nor net reclassification improvement (NRI) is necessary when direct evaluation of clinical performance of a program based on the risk model is possible. The measure of clinical utility is difference in rates or risk of clinical outcomes from intervention strategies using rules for assignment based on the different models. The differences in risk and the distribution of the possible clinical presentation from the models are measures of risk stratification or of the extra precision resulting from the added complexity of a model. Of course, clinical utility also depends on the efficacy of available interventions; a model with prefect predictions has no clinical utility without an effective intervention. Pencina et al. (4) discussed measures for evaluating risk models by purely statistical assessment without consideration of interventions with known efficacy, which is needed for evaluating clinical utility. They advocated reporting IDI and NRI in addition to the most commonly used measure, AUC. AUC is a nonparametric test statistic (the Mann–Whitney U test equivalent to a Wilcoxon rank sum test) of equality of distribution of estimated risk in cases and controls. The IDI adds to the AUC because an estimate of difference in means adds information not available from a statistic that is based on ranks and not actual values. NRIs can capture effects of differences in distribution, including spread and skewness, that are not reflected in the difference in means. As noted by Pencina et al. (4), 2 of the American Heart Association's 2009 criteria for evaluation of novel markers of cardiovascular risk are “documentation of incremental information when added to standard risk markers” and “assessment of effects on patient management and outcomes” (5, p. 2408). As noted by Kerr et al. (6), assessment of the marginal increase in utility from an improved model requires an objective assessment of effects on patient management and outcomes; the incremental information criterion needs to be a purely statistical measure if it is separate from clinical utility. The difference between the clinical and statistical views is manifested in one particularly important way: the role of variability in the risk. Less variability, or more homogeneity, in risk within cases and controls increases the AUC and other measures of discrimination; in contrast, greater variation in risk in the study population for which decisions are to be made increases the potential for assigning a intervention for those at extreme risks different from the intervention appropriate for one at average risk. In other words, small variance in risk, conditional on disease, increases discrimination, but large unconditional variance increases the potential for clinical utility. Of course, the variation needs to be real, not a consequence of random variation of risk estimates or misclassification of markers in the model. This exchange of views (1, 2, 6, 7) highlights the differences between the clinical and statistical perspectives on risk models. The connection between the 2 perspectives is not yet clear to all.
Read full abstract