In clinical practice, surrogate variables are commonly used as an indirect measure when it is difficult or expensive to measure the primary outcome variable X, based on which the disease status is assessed. In this article, we consider the problem of constructing an optimal binary surrogate Y to substitute such the feature variable X. To retain samples that have rare values in X, the paired sample (X, Y) is usually selected based on stratified sampling, where the strata are constructed using the disjoint intervals with the support of X. For such a sampling design, the stratum proportions are usually unknown such that proportional allocation is infeasible and (X, Y)’s cannot be regarded as an i.i.d. sample between strata. We estimate the unknown cutoff determining higher/lower levels of X that optimally match the variable Y and provide the true positive rates (TPR) adjusted for the disproportionate stratum weights. Our approach is to estimate the underlying distribution of X, then conduct an ad-hoc estimation for the TPR and for the expected prediction errors under zero-one loss function. We develop parametric estimate of the distribution of X under exponential family assumption and a weighted-kernel density estimator when the distribution of X is unspecified. We illustrate our methods on various simulation studies and on a real example where binary surrogates were evaluated for a medical device. The simulation results indicate that our approach performs well.
Read full abstract