Abstract

Abstract Introductory Statement: The goal is to use machine learning (ML) and large language model (LLM) to augment the manual curation of cancer data elements. Introduction: Memorial Sloan Kettering Cancer Center (MSKCC) has ~100,000 cancer patients and counting with genomic testing. Clinicians use genomic data for research but lack clinical data to analyze together. We use a vendor, VASTA Global to hire curators to manually curate cancer patient’s core clinical data elements (CCDE) within unstructured/paragraph text in electronic medical record (EMR) notes. CCDE encompasses 122 data elements that include a patient’s full cancer history that can take up to 1 working day to curate. We collaborated with the Realyze Intelligence Healthcare Solutions vendor to use their AI pipeline to generate the manual curated dataset. Realyze generated the CCDE data elements such as histology, pathology site, MMR, TNM staging, ECOG, and KPS for a pilot lung cancer cohort of 150 patients. We manually validated the generated data for 74 out of 150 patients. Methods:The Realyze platform uses a combination of LLMs, ML algorithms and standard terminologies to create a cancer patient model. These models are flexible enough to address the unique needs and challenges of a pan-cancer oncology model. By using standardized FHIR export, results were delivered to a data lake solution and written into a REDCap database to enable human review. Summary:We manually assessed 74 patients. The NLP gave concordant values for MMR, KPS and TNM staging for 100% of the instances. For MMR these were all null values with false negative (FN) of 100% accuracy. Pathology site had 92.15% accuracy while histology has 97.5% accuracy. Conclusion:Will work on refining pathology site and histology’s ICDO3 list to increase the percentage of accuracy. Once Realyze refines their model for these data elements we will re-run it on a larger cohort of cancer patients and calculate the accuracy. Accuracy Results Clinical data elements 74 patients assessed: Accuracy % ECOG 98.6 KPS 100 T (path) 100 T (clinical) 100 N (path) 100 N (clinical) 100 M (path) 100 M(clinical) 100 MMR 100 Histology (path) 97.5 Path site 92.15 Citation Format: Andrew Niederhausern, Nadia S. Bahadur, Gary Wallace, Gilan E. Saadawi, John Philip. Machine learning and large language model approach to pancancer data elements [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 4966.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call