BackgroundThe prevalence of missing data in the National Cancer Database (NCDB) has marked implications on clinical care and research. The objective of this study was to enhance the NCDB by decreasing rates of missingness and adding new variables using automated statistical methodology. MethodsOne health system’s NCDB data from 2011–2021 was linked to electronic health record (EHR). Variables with frequent missingness and new clinically significant variables not yet included in the NCDB including patient Eastern Cooperative Oncology Group (ECOG) score, specific chemotherapy regimen, American Society of Anesthesiologists Physical Status Classification (ASA class), and discrete surgical procedure were identified in structured and unstructured EHR data. After automated incorporation of structured data from EHR, a natural language processing tool incorporating rule-based algorithms was designed to further extract variables from unstructured notes. Rates of missingness were compared between the original NCDB and the enhanced dataset, and example multivariable models were run to assess for altered model performance with reduced missingness and the addition of new clinically significant variables (chemotherapy regimen). ResultsA total of 6050 patients with NCDB records were linked to their EHR data. Prior to enhancement, rates of missingness for key variables ranged from 2.0% to 5.3%. Following dataset enhancement, missingness was significantly reduced, with relative missingness being reduced between 31.9% to 68.0%. Of the new variables added, 1367 (22.6%) of 6050 patients gained ECOG score, and 1099 (57.8%) of 1901 who received chemotherapy gained their chemotherapy regimen. Of 2989 who underwent surgery, 979 (32.8%) gained their procedure name and 621 (20.8%) gained ASA class. Comparison of the multivariable models demonstrated significant differences between the original NCDB and the enhanced dataset. Specifically, when replacing the binary predictor for chemotherapy in the original NCDB data with discrete regimens, the effect of ethnicity diminished, and the effect of radiation became significant. DiscussionWe applied statistical methodology to reduce rates of missingness in existing variables and add new variables to enrich the NCDB. While further refinement is needed to decrease missingness in new variables, this automated methodology can replace or augment manual chart review and improve the ability of to use the NCDB to study unanswered questions leading to clinical advancements in oncology.
Read full abstract