Using machine learning on real-world data to predict metastatic status.

Foad H Green,Mary Tran,Hu T Huang,Joshua Loving,Michelle Lerman,Matthew J Rioth,Vinod Subramanian

doi:10.1200/jco.2022.40.16_suppl.1550

Abstract

1550 Background: Real world data (RWD) is increasingly used to inform research, patient care, and population health in oncology; however, using RWD at scale requires accurate methods to identify clinically-relevant attributes. Metastatic status is a highly relevant clinical attribute in cancer patients but it is not routinely captured in structured formats and its determination conventionally requires review and interpretation by certified tumor registrars (CTRs). Clinical diagnoses, treatments, imaging procedures and other clinical variables documented in electronic health records (EHRs) can be used to differentiate metastatic from non-metastatic patients. This study describes an effective machine learning approach in utilizing prevalent and standardized data elements from EHRs across multiple health systems. Methods: 28,043 lung cancer and breast cancer patients from two large health systems within the Syapse Learning Health Network with data sources from CTR abstraction and EHRs were analyzed. Patients were labeled for reference metastatic status by CTRs and split into training (n = 22,434) and testing (n = 5,609) cohorts, with proportionate distribution of cancer type and metastatic status between cohorts. A regularized gradient boosting algorithm, XGBoost, was trained using over 750 variables from the patient records collected at the time of or after the initial cancer diagnosis. Results: Integration of ICD-10-CM codes with antineoplastic treatment history and radiologic imaging procedure orders achieved metastatic status prediction with increases to precision and recall in lung cancer (21% and 32% respectively) and breast cancer (39% and 9% respectively), when compared to the use of only ICD-10-CM diagnosis codes for secondary malignant neoplasms (Table). The addition of treatment and procedure data from different cancer types improved the model classification within individual cancer types. Conclusions: One of the biggest challenges in using RWD for precision oncology is identification of clinically-relevant phenotypes at scale. Here we demonstrate a scalable evidence-based method utilizing structured data for imputing metastatic status with high predictive power from two separate health systems. With further validation, this approach may be generalized to other cancer types, applied to temporal slices of data to identify changes in metastatic status, as well as provide a high-confidence designation of metastatic status for other use cases such as staging.[Table: see text]

Full Text