Sample size planning for survival prediction with focus on high-dimensional data

Heiko Götte,Isabella Zwiener

doi:10.1002/sim.5550

Abstract

Sample size planning should reflect the primary objective of a trial. If the primary objective is prediction, the sample size determination should focus on prediction accuracy instead of power. We present formulas for the determination of training set sample size for survival prediction. Sample size is chosen to control the difference between optimal and expected prediction error. Prediction is carried out by Cox proportional hazards models. The general approach considers censoring as well as low-dimensional and high-dimensional explanatory variables. For dimension reduction in the high-dimensional setting, a variable selection step is inserted. If not all informative variables are included in the final model, the effect estimates are biased towards zero. The bias affects the prediction error, and its magnitude is influenced by the sample size. For variable selection, we consider two approaches: least absolute shrinkage and selection operator (LASCO) and univariable selection. For univariable selection, we can calculate input parameters for the sample size formula. For the LASCO, supportive simulations are necessary to appropriately choose the input parameters. We investigate the performance of the proposed formulas with the use of simulations. Simulation results support the validity of the sample size formulas. An application of a real data example illustrates the practical implementation of the method.

Full Text