Abstract

BackgroundThe surge in biomarker development calls for research on statistical evaluation methodology to rigorously assess emerging biomarkers and classification models. Recently, several authors reported the puzzling observation that, in assessing the added value of new biomarkers to existing ones in a logistic regression model, statistical significance of new predictor variables does not necessarily translate into a statistically significant increase in the area under the ROC curve (AUC). Vickers et al. concluded that this inconsistency is because AUC “has vastly inferior statistical properties,” i.e., it is extremely conservative. This statement is based on simulations that misuse the DeLong et al. method. Our purpose is to provide a fair comparison of the likelihood ratio (LR) test and the Wald test versus diagnostic accuracy (AUC) tests.DiscussionWe present a test to compare ideal AUCs of nested linear discriminant functions via an F test. We compare it with the LR test and the Wald test for the logistic regression model. The null hypotheses of these three tests are equivalent; however, the F test is an exact test whereas the LR test and the Wald test are asymptotic tests. Our simulation shows that the F test has the nominal type I error even with a small sample size. Our results also indicate that the LR test and the Wald test have inflated type I errors when the sample size is small, while the type I error converges to the nominal value asymptotically with increasing sample size as expected. We further show that the DeLong et al. method tests a different hypothesis and has the nominal type I error when it is used within its designed scope. Finally, we summarize the pros and cons of all four methods we consider in this paper.SummaryWe show that there is nothing inherently less powerful or disagreeable about ROC analysis for showing the usefulness of new biomarkers or characterizing the performance of classification models. Each statistical method for assessing biomarkers and classification models has its own strengths and weaknesses. Investigators need to choose methods based on the assessment purpose, the biomarker development phase at which the assessment is being performed, the available patient data, and the validity of assumptions behind the methodologies.

Highlights

  • The surge in biomarker development calls for research on statistical evaluation methodology to rigorously assess emerging biomarkers and classification models

  • Investigators need to choose methods based on the assessment purpose, the biomarker development phase at which the assessment is being performed, the available patient data, and the validity of assumptions behind the methodologies

  • The cause of the inconsistency puzzle between the statistical significance tests of nested models and area under the ROC curve (AUC) tests revealed in Vickers et al [1] is that the DeLong et al method was incorrectly applied to a set of data for which it was not intended

Read more

Summary

Discussion

We begin with a brief review of the basic elements for the assessment of statistical learning models for pattern classification. As shown by Delmer et al [8], under the multivariate normality assumption, equality of two ideal AUCs for two nested linear discriminant functions is equivalent to discriminant coefficients equal to zero for variables not shared by the two models This means that the LR test, the Wald test, and the F test test the same null hypothesis with different test statistics. The DeLong method is designed to test the null hypothesis that the true performance of two fixed models are equal: A(r1) = A(r2), where the superscripts 1, 2 denote two models under comparison, and the subscripts indicate that the training data sets may be different. In each MC trial, we drew a sample of test scores from the designated distribution and applied the DeLong et al method [3] and the U-statistics based method [9] to compare the AUC values. We summarize the pros and cons of the different paradigms and statistical tests below

Background
The purpose of assessment
Assumptions on the multiple-biomarker measurement data
Applicability
Performance
Findings
10. Efron B
19. Sen PK
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call