Abstract 3892: Systematic generation of a clinicogenomic harmonized oncologic real-world dataset (MSK-CHORD)

Christopher J Fong,Calla Chennault,Avery Wang,Doori Rose,Karl Pichotta,Anika Begum,Mehnaj Ahmed,Deborah Schrag,Arfath Pasha,Subhiksha Nandakumar,Nikolaus Schultz,Thinh Tran,Tom Fu,Walid K Chatila,Benjamin Gross,Michele Waters,Kenneth L Kehl,Peter Stetson,Justin Jee,Ritika Kundra,Ino De Bruijn,Brooke Mastrogiacomo,Ramyasree Madupuri,Michael Berger,Tom Pollard,Darin Moore,Jian Carrot-Zhang,Raymond Lim,Mirella Altoe,Anisha Luthra,Mono Pirun,Armaan Kohli,Pedram Razavi,Bob Li,Aaron Lisman

doi:10.1158/1538-7445.am2024-3892

Abstract

Abstract Clinical data storage in unstructured notes and siloed datasets present a major challenge for large-scale cancer informatics. Whether natural language processing (NLP) combined with multimodal integration across datasets can produce a mineable resource and improve discovery of relationships between tumor genomics and clinical phenotypes is unknown. We hypothesized that NLP could automatically annotate a pan-cancer corpus of 82,464 patients with tumor genomic sequencing. To develop algorithms to annotate free-text reports, we leveraged the AACR Project GENIE Biopharma Collaborative (BPC), a structured curation of EMR from five cancer types (non-small cell lung (NSCLC), breast, colorectal, prostate, and pancreatic cancer), to train and validate several Transformer and rule based-based NLP models. After automating the generation of NLP annotations alongside medication, demographic, tumor registry, survival, and tumor genomic sequencing data, we tested whether clinicogenomic relationships not apparent in the smaller BPC cohort might be discoverable in the larger cohort. In 5-fold cross-validation, NLP Transformers accurately annotated the presence of cancer (AUC=0.99), cancer progression (AUC=0.97), and sites of disease (AUC=0.99) from radiology reports, and presence of prior outside treatment (AUC=0.98) and hormone receptor (HR) and HER2 receptor status (AUC=0.98, 0.98) from clinician notes. In addition, rule-based models, trained on non-BPC data and validated on the whole BPC cohort, annotated smoking status from clinician notes (ACC=0.95), and Gleason score (ACC=1.0), PD-L1 status (ACC=0.98), and mismatch repair deficiency (ACC=0.98) from histopathology reports. NLP annotations were merged with genomic and other structured clinical data to create a Clinicogenomic, Harmonized Oncologic Real-world Dataset (MSK-CHORD). Finally, we tested if associations not apparent in the BPC might be discoverable in MSK-CHORD. We found positive associations between Gleason score and gene-level alterations in prostate cancer including TP53, PTEN and BRCA2 (q&lt;0.1), none of which were adequately powered for detection in the BPC. We found PD-L1 status was associated with better survival following immunotherapy treatment in NSCLC, but only in the larger MSK-CHORD was this association statistically significant. In breast cancer, NF1 mutations were associated with prior therapy in both cohorts, but this association was only significant in MSK-CHORD. The infrastructure generating MSK-CHORD uses a combination of on-premise and cloud computing resources and open-source development operation applications to automate processes. Once annotations are created, data is imported into a local instance of cBioPortal, where researchers can visualize data and perform analyses. The system generating MSK-CHORD demonstrates how large-scale data delivery and integration can fuel cancer research. Citation Format: Christopher J. Fong, Karl Pichotta, Thinh Tran, Michele Waters, Tom Fu, Mono Pirun, Mirella Altoe, Brooke Mastrogiacomo, Anisha Luthra, Mehnaj Ahmed, Arfath Pasha, Armaan Kohli, Raymond Lim, Tom Pollard, Darin Moore, Benjamin Gross, Avery Wang, Calla Chennault, Ritika Kundra, Ramyasree Madupuri, Ino de Bruijn, Aaron Lisman, Walid K. Chatila, Subhiksha Nandakumar, Anika Begum, Doori Rose, Kenneth L. Kehl, Deborah Schrag, Michael Berger, Jian Carrot-Zhang, Pedram Razavi, Bob Li, Peter Stetson, Nikolaus Schultz, Justin Jee. Systematic generation of a clinicogenomic harmonized oncologic real-world dataset (MSK-CHORD) [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 3892.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Abstract 3892: Systematic generation of a clinicogenomic harmonized oncologic real-world dataset (MSK-CHORD)

Abstract

Talk to us

Similar Papers

More From: Cancer Research

Lead the way for us

Similar Papers

Application of texture analysis based on T2-weighted magnetic resonance images in discriminating Gleason scores of prostate cancer.
Ruigen Pan ...
Journal of X-Ray Science and Technology | VOL. 28
Ruigen Pan, et. al.Ruigen Pan ...
01 Jan 2020
Journal of X-Ray Science and Technology | VOL. 28

Breast and prostate cancer: an analysis of common epidemiological, genetic, and biochemical features.
Carlos LóPez-OtíN ... Eleftherios P Diamandis
Endocrine Reviews | VOL. 19
Carlos LóPez-OtíN, et. al.Carlos LóPez-OtíN ... Eleftherios P Diamandis
01 Aug 1998
Endocrine Reviews | VOL. 19

Abstract PS8-34: High rates of BRCA1 and BRCA2 germline mutations among Arab patients with triple-negative breast cancer
Hikmat Abdel-Razeq ... Rayan Bater
Cancer Research | VOL. 81
Hikmat Abdel-Razeq, et. al.Hikmat Abdel-Razeq ... Rayan Bater
15 Feb 2021
Cancer Research | VOL. 81

Abstract P3-07-06: Guideline-based multi-gene panel (MGP) testing for germline pathogenic variants among patients diagnosed with breast cancer: Regional perspectives
Hikmat Abdel-Razeq ... Majd Hamed Allah
Cancer Research | VOL. 82
Hikmat Abdel-Razeq, et. al.Hikmat Abdel-Razeq ... Majd Hamed Allah
15 Feb 2022
Cancer Research | VOL. 82

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Abstract 3892: Systematic generation of a clinicogenomic harmonized oncologic real-world dataset (MSK-CHORD)

Abstract

Talk to us

Similar Papers

More From: Cancer Research