Comparison of Machine Learning Approaches for Missing Data Imputation Among Non-Small Cell Lung Cancer Patients

D.X Yang,R Khera,E Chang,M Joel,G Janda,H.S.M Park,S Aneja

doi:10.1016/j.ijrobp.2021.07.264

Abstract

<h3>Purpose/Objective(s)</h3> Cancer registries are important sources of real-world data capturing a large number of complex data elements such as cancer stage and treatments, but the prevalence of missing data is often high. Previous research suggest missing data can identify patients within cancer registries with worse survival outcomes, potentially introducing substantial bias in studies using complete case analysis. Recent computational advances have enabled the application of machine learning (ML) imputation methods to large real-world datasets, however, the efficacy of these approaches for cancer patients is unknown. <h3>Materials/Methods</h3> We queried the National Cancer Database for non-small cell lung cancer (NSCLC) patients diagnosed in 2014 with complete data in 19 variables of known clinical and prognostic significance. Complete records were chosen given a reference value is needed to compare the efficacy of imputation techniques. We performed data preprocessing and generated synthetic missing data in 10 to 50% of records at random for each variable, then performed imputation using substitution (control) and five different ML approaches: Bayesian ridge regression under a multivariate imputation by chained equations (MICE) framework, k-nearest neighbors (KNN), matrix completion by spectral regularization (SoftImpute), iterative random forests (MissForest), and denoising autoencoders (DA). Imputation efficacy was measured by normalized root-mean-square error (RMSE) for continuous variables and proportion of falsely classified entries (PFC) for categorical variables. Algorithm runtimes were measured using a cloud computing instance with 16 virtual processors and 42 gigabytes of memory. <h3>Results</h3> We identified 50,790 NSCLC patients with complete data, each with 81 features after data preprocessing. Mean substitution for continuous variables had a RMSE of 0.091, and mode substitution for categorical variables had a PFC of 0.406. In comparison, among the tested ML methods, MICE had the lowest RMSE (best performance) for continuous variables ranging from 0.069 to 0.077 for 10 to 50% missing data, and MissForest had the lowest PFC (best performance) for categorical variables ranging from 0.251 to 0.311 for 10 to 50% missing data. Runtimes for MICE ranged from 118.9 to 267.9 seconds and for MissForest ranged from 112.0 to 186.8 seconds. KNN and DA had higher runtimes despite lower performance, while substitution runtimes were under 0.1 second for all levels of missing data. <h3>Conclusion</h3> ML methods achieved promising levels of imputation efficacy with acceptable computing runtimes for NSCLC patients within a large national cancer registry. These approaches can potentially improve clinical insights from registry data for NSLCC patients through enabling more complete cohorts incorporating ML imputed information.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Comparison of Machine Learning Approaches for Missing Data Imputation Among Non-Small Cell Lung Cancer Patients

Abstract

Talk to us

Similar Papers

More From: International Journal of Radiation OncologyBiologyPhysics

Lead the way for us

Similar Papers

DEEP LEARNING-BASED APPROACH FOR MISSING DATA IMPUTATION
Pinar Ci̇han
Eskişehir Teknik Üniversitesi Bilim ve Teknoloji Dergisi B - Teorik Bilimler | VOL. 8
Pinar Ci̇hanPinar Ci̇han
31 Aug 2020
Eskişehir Teknik Üniversitesi Bilim ve Teknoloji Dergisi B - Teorik Bilimler | VOL. 8

A Comparison of Multiple Imputation Methods for Data with Missing Values
Geeta Chhabra ... Jayanthi Ranjan
Indian Journal of Science and Technology | VOL. 10
Geeta Chhabra, et. al.Geeta Chhabra ... Jayanthi Ranjan
18 May 2017
Indian Journal of Science and Technology | VOL. 10

Multi-variate infilling of missing daily discharge data on the Niger basin
Ganiyu Titilope Oyerinde ... Oluwafemi E Adeyeri
Water Practice and Technology | VOL. 16
Ganiyu Titilope Oyerinde, et. al.Ganiyu Titilope Oyerinde ... Oluwafemi E Adeyeri
28 May 2021
Water Practice and Technology | VOL. 16

Comparison of Single and MICE Imputation Methods for Missing Values: A Simulation Study
Nurul Azifah Mohd Pauzi ... Yap Bee Wah
Pertanika Journal of Science and Technology | VOL. 29
Nurul Azifah Mohd Pauzi, et. al.Nurul Azifah Mohd Pauzi ... Yap Bee Wah
30 Apr 2021
Pertanika Journal of Science and Technology | VOL. 29

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Comparison of Machine Learning Approaches for Missing Data Imputation Among Non-Small Cell Lung Cancer Patients

Abstract

Talk to us

Similar Papers

More From: International Journal of Radiation Oncology*Biology*Physics

More From: International Journal of Radiation OncologyBiologyPhysics