This study uses electronic health record (EHR) data to predict 12 common cancer symptoms, assessing the efficacy of machine learning (ML) models in identifying symptom influencers. We analyzed EHR data of 8156 adults diagnosed with cancer who underwent cancer treatment from 2017 to 2020. Structured and unstructured EHR data were sourced from the Enterprise Data Warehouse for Research at the University of Iowa Hospital and Clinics. Several predictive models, including logistic regression, random forest (RF), and XGBoost, were employed to forecast symptom development. The performances of the models were evaluated by F1-score and area under the curve (AUC) on the testing set. The SHapley Additive exPlanations framework was used to interpret these models and identify the predictive risk factors associated with fatigue as an exemplar. The RF model exhibited superior performance with a macro average AUC of 0.755 and an F1-score of 0.729 in predicting a range of cancer-related symptoms. For instance, the RF model achieved an AUC of 0.954 and an F1-score of 0.914 for pain prediction. Key predictive factors identified included clinical history, cancer characteristics, treatment modalities, and patient demographics depending on the symptom. For example, the odds ratio (OR) for fatigue was significantly influenced by allergy (OR = 2.3, 95% CI: 1.8-2.9) and colitis (OR = 1.9, 95% CI: 1.5-2.4). Our research emphasizes the critical integration of multimorbidity and patient characteristics in modeling cancer symptoms, revealing the considerable influence of chronic conditions beyond cancer itself. We highlight the potential of ML for predicting cancer symptoms, suggesting a pathway for integrating such models into clinical systems to enhance personalized care and symptom management.
Read full abstract