Abstract

To accelerate cancer research that correlates biomarkers with clinical endpoints, methods are needed to ascertain outcomes from electronic health records at scale. Here, we train deep natural language processing (NLP) models to extract outcomes for participants with any of 7 solid tumors in a precision oncology study. Outcomes are extracted from 305,151 imaging reports for 13,130 patients and 233,517 oncologist notes for 13,511 patients, including patients with 6 additional cancer types. NLP models recapitulate outcome annotation from these documents, including the presence of cancer, progression/worsening, response/improvement, and metastases, with excellent discrimination (AUROC > 0.90). Models generalize to cancers excluded from training and yield outcomes correlated with survival. Among patients receiving checkpoint inhibitors, we confirm that high tumor mutation burden is associated with superior progression-free survival ascertained using NLP. Here, we show that deep NLP can accelerate annotation of molecular cancer datasets with clinically meaningful endpoints to facilitate discovery.

Highlights

  • To accelerate cancer research that correlates biomarkers with clinical endpoints, methods are needed to ascertain outcomes from electronic health records at scale

  • We identified patients with any of 13 common malignant solid tumors whose tumor specimens underwent generation sequencing (NGS) through the PROFILE initiative at DanaFarber Cancer Institute (DFCI) from 2013 to 20214,10

  • To demonstrate an application of a clinico-genomic dataset in which clinical outcomes were defined using our natural language processing (NLP) models, we examined the association between progression-free survival (PFS) and tumor mutation burden (TMB), which has previously been characterized as a predictive biomarker for patients receiving immunotherapy[15,16,17], among 1,374 patients who received 1,694 lines of palliative-intent systemic therapy

Read more

Summary

Introduction

To accelerate cancer research that correlates biomarkers with clinical endpoints, methods are needed to ascertain outcomes from electronic health records at scale. Modern cancer research increasingly focuses on precision oncology[1], seeking to identify prognostic and predictive biomarkers to guide drug discovery and selection of optimal therapies for individual patients Pursuing this objective, for uncommon cancers or rare biomarker patterns for common malignancies, requires large datasets of tumors that have undergone deep molecular characterization. We previously demonstrated the feasibility of training interpretable natural language processing (NLP) models to extract outcomes from imaging reports[8] and medical oncologist notes[9] for patients with non-small cell lung cancer The generalizability of this approach to other types of cancer and its application to create a linked clinico-genomic dataset have not been previously described. We create a large multi-cancer clinico-genomic dataset by applying this technique to EHR data at scale, and we demonstrate the utility of this type of dataset by exploring associations between tumor mutation burden and progression-free survival on immune checkpoint inhibitor therapy

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call