Abstract Clinical data storage in unstructured notes and siloed datasets present a major challenge for large-scale cancer informatics. Whether natural language processing (NLP) combined with multimodal integration across datasets can produce a mineable resource and improve discovery of relationships between tumor genomics and clinical phenotypes is unknown. We hypothesized that NLP could automatically annotate a pan-cancer corpus of 82,464 patients with tumor genomic sequencing. To develop algorithms to annotate free-text reports, we leveraged the AACR Project GENIE Biopharma Collaborative (BPC), a structured curation of EMR from five cancer types (non-small cell lung (NSCLC), breast, colorectal, prostate, and pancreatic cancer), to train and validate several Transformer and rule based-based NLP models. After automating the generation of NLP annotations alongside medication, demographic, tumor registry, survival, and tumor genomic sequencing data, we tested whether clinicogenomic relationships not apparent in the smaller BPC cohort might be discoverable in the larger cohort. In 5-fold cross-validation, NLP Transformers accurately annotated the presence of cancer (AUC=0.99), cancer progression (AUC=0.97), and sites of disease (AUC=0.99) from radiology reports, and presence of prior outside treatment (AUC=0.98) and hormone receptor (HR) and HER2 receptor status (AUC=0.98, 0.98) from clinician notes. In addition, rule-based models, trained on non-BPC data and validated on the whole BPC cohort, annotated smoking status from clinician notes (ACC=0.95), and Gleason score (ACC=1.0), PD-L1 status (ACC=0.98), and mismatch repair deficiency (ACC=0.98) from histopathology reports. NLP annotations were merged with genomic and other structured clinical data to create a Clinicogenomic, Harmonized Oncologic Real-world Dataset (MSK-CHORD). Finally, we tested if associations not apparent in the BPC might be discoverable in MSK-CHORD. We found positive associations between Gleason score and gene-level alterations in prostate cancer including TP53, PTEN and BRCA2 (q<0.1), none of which were adequately powered for detection in the BPC. We found PD-L1 status was associated with better survival following immunotherapy treatment in NSCLC, but only in the larger MSK-CHORD was this association statistically significant. In breast cancer, NF1 mutations were associated with prior therapy in both cohorts, but this association was only significant in MSK-CHORD. The infrastructure generating MSK-CHORD uses a combination of on-premise and cloud computing resources and open-source development operation applications to automate processes. Once annotations are created, data is imported into a local instance of cBioPortal, where researchers can visualize data and perform analyses. The system generating MSK-CHORD demonstrates how large-scale data delivery and integration can fuel cancer research. Citation Format: Christopher J. Fong, Karl Pichotta, Thinh Tran, Michele Waters, Tom Fu, Mono Pirun, Mirella Altoe, Brooke Mastrogiacomo, Anisha Luthra, Mehnaj Ahmed, Arfath Pasha, Armaan Kohli, Raymond Lim, Tom Pollard, Darin Moore, Benjamin Gross, Avery Wang, Calla Chennault, Ritika Kundra, Ramyasree Madupuri, Ino de Bruijn, Aaron Lisman, Walid K. Chatila, Subhiksha Nandakumar, Anika Begum, Doori Rose, Kenneth L. Kehl, Deborah Schrag, Michael Berger, Jian Carrot-Zhang, Pedram Razavi, Bob Li, Peter Stetson, Nikolaus Schultz, Justin Jee. Systematic generation of a clinicogenomic harmonized oncologic real-world dataset (MSK-CHORD) [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 3892.
Read full abstract