Data analyses by machine learning (ML) algorithms are gaining popularity in biomedical research. When time-to-event data are of interest, censoring is common and needs to be properly addressed. Most ML methods cannot conveniently and appropriately take the censoring information into consideration, potentially leading to inaccurate or biased results. We aim to develop a general-purpose method for imputing censored survival data, facilitating downstream ML analysis. In this study, we propose a novel method of imputing the survival times for censored observations. The proposal is based on their conditional survival distributions (CondiS) derived from Kaplan-Meier estimators. CondiS can replace censored observations with their best approximations from the statistical model, allowing for direct application of ML methods. When covariates are available, we extend CondiS by incorporating the covariate information through ML modeling (CondiS-X), which further improves the accuracy of the imputed survival time. Compared with existing methods with similar purposes, the proposed methods achieved smaller prediction errors and higher concordance with the underlying true survival times in extensive simulation studies. We also demonstrated the usage and advantages of the proposed methods through two real-world cancer datasets. The major advantage of CondiS is that it allows for the direct application of standard ML techniques for analysis once the censored survival times are imputed. We present a user-friendly R package to implement our method, which is a useful tool for ML-based biomedical research in this era of big data.
Read full abstract