Abstract

ABSTRACT New technologies have recently emerged which enable simultaneous evaluation of large numbers of biological markers. The resultant marker data are often used to build predictive models which claim to be able to distinguish between two or more classes of subjects. However, when there are a large number of variables and a small number of observations, the problem of overfitting arises, where the model parameters are optimized for the observed data but may fit poorly for independent data. Here we illustrate how various quantities related to true and apparent predictive ability scale with the number of markers and the number of observations (subjects). Specifically, we utilize a model which takes the form of a linear combination of a subset of marker variables; the model produces a propensity score which generates an ROC curve and corresponding area under the ROC curve (AUC), which is a measure of predictive ability. Given the true marker distributions, there is a parameter value so that the resulting predictive model gives the optimal true AUC. In practice, the true distributions are unknown, so experimental data are used to derive a parameter value which produces the optimal apparent AUC, where the “apparent” AUC is based on the observed instead of the true distributions. If the above model with the estimated optimal parameter is then used on an independent data set, it would have an actual AUC derived from the estimated optimal parameter and the true marker distributions. The difference between the apparent AUC and the actual AUC can be denoted as the total error in estimating predictive ability. This total error can be additively decomposed into the “overfitting error”, namely, the apparent AUC minus the optimal AUC, and the “mis-specification error”, namely the optimal AUC minus the actual AUC. We focus here on how these errors scale with the number of observations and the number of markers, where the latter are divided into “null” markers which contain no information as to class status and “associated” markers which are related to class status.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call