Two-phase study designs are ideal for focused sub-studies based on large prospective cohorts when the outcome of interest is an event that is rare in the full cohort, and additional covariates are expensive or difficult to measure. Researchers often wish to examine large numbers of covariates for association with outcomes of interest. In the context of cancer, hundreds to millions of genetic markers may be considered, along with environmental exposures. A computationally efficient variable selection method is proposed for two-phase failure time studies with stratified sampling under the Cox proportional hazards model. The penalized estimator is obtained from a penalized (weighted) Cox log partial likelihood using a pathwise cyclical coordinate descent algorithm which is scalable for high dimensional datasets where the number of features is much larger than the sample size (p≫n). A detailed simulation study to examine the performance of the proposed methodology is described. The variable selection and estimation procedure is then used to obtain a model for predicting acute myeloid leukaemia using somatic stem cell mutation profiles derived from blood samples, based on a two-phase sample from the European Prospective Investigation into Cancer and Nutrition (EPIC) study.
Read full abstract