<h3>Purpose/Objective(s)</h3> Cancer registries are important sources of real-world data capturing a large number of complex data elements such as cancer stage and treatments, but the prevalence of missing data is often high. Previous research suggest missing data can identify patients within cancer registries with worse survival outcomes, potentially introducing substantial bias in studies using complete case analysis. Recent computational advances have enabled the application of machine learning (ML) imputation methods to large real-world datasets, however, the efficacy of these approaches for cancer patients is unknown. <h3>Materials/Methods</h3> We queried the National Cancer Database for non-small cell lung cancer (NSCLC) patients diagnosed in 2014 with complete data in 19 variables of known clinical and prognostic significance. Complete records were chosen given a reference value is needed to compare the efficacy of imputation techniques. We performed data preprocessing and generated synthetic missing data in 10 to 50% of records at random for each variable, then performed imputation using substitution (control) and five different ML approaches: Bayesian ridge regression under a multivariate imputation by chained equations (MICE) framework, k-nearest neighbors (KNN), matrix completion by spectral regularization (SoftImpute), iterative random forests (MissForest), and denoising autoencoders (DA). Imputation efficacy was measured by normalized root-mean-square error (RMSE) for continuous variables and proportion of falsely classified entries (PFC) for categorical variables. Algorithm runtimes were measured using a cloud computing instance with 16 virtual processors and 42 gigabytes of memory. <h3>Results</h3> We identified 50,790 NSCLC patients with complete data, each with 81 features after data preprocessing. Mean substitution for continuous variables had a RMSE of 0.091, and mode substitution for categorical variables had a PFC of 0.406. In comparison, among the tested ML methods, MICE had the lowest RMSE (best performance) for continuous variables ranging from 0.069 to 0.077 for 10 to 50% missing data, and MissForest had the lowest PFC (best performance) for categorical variables ranging from 0.251 to 0.311 for 10 to 50% missing data. Runtimes for MICE ranged from 118.9 to 267.9 seconds and for MissForest ranged from 112.0 to 186.8 seconds. KNN and DA had higher runtimes despite lower performance, while substitution runtimes were under 0.1 second for all levels of missing data. <h3>Conclusion</h3> ML methods achieved promising levels of imputation efficacy with acceptable computing runtimes for NSCLC patients within a large national cancer registry. These approaches can potentially improve clinical insights from registry data for NSLCC patients through enabling more complete cohorts incorporating ML imputed information.
Read full abstract