Abstract

e18747 Background: Accurate longitudinal cancer treatments are vital for establishing primary endpoints such as outcome as well as for the investigation of adverse events. However, many longitudinal therapeutic regimens are not well captured in structured electronic health records (EHRs). Thus, their recognition in unstructured data such as clinical notes is critical to gain an accurate description of the real-world patient treatment journey. Here, we demonstrate a scalable approach to extract high-quality longitudinal cancer treatments from lung cancer patients' clinical notes using a Bidirectional Long Short Term Memory (BiLSTM) and Conditional Random Fields (CRF) based natural language processing (NLP) pipeline. Methods: The lung cancer (LC) cohort of 4,698 patients was curated from the Mount Sinai Healthcare system (2003-2020). Two domain experts developed a structured framework of entities and semantics that captured treatment and its temporality. The framework included therapy type (chemotherapy, targeted therapy, immunotherapy, etc.), status (on, off, hold, planned, etc.) and temporal reasoning entities and relations (admin_date, duration, etc.) We pre-annotated 149 FDA-approved cancer drugs and longitudinal timelines of treatment on the training corpus. A NLP pipeline was implemented with BiLSTM-CRF-based deep learning models to train and then apply the resulting models to the clinical notes of LC cohort. A postprocessor was developed to subsequently post-coordinate and refine the output. We performed both cross-evaluation and independent evaluation to assess the pipeline performance. Results: We applied the NLP pipeline to the 853,755 clinical notes, and identified 1,155 distinct entities for 194 cancer generic drugs, including 74 chemotherapy drugs, 21 immunotherapy drugs, and 99 targeted therapy drugs. We identified chemotherapy, immunotherapy, or targeted therapy data for 3,509 patients in the LC cohort from the clinical notes. Compared to only 2,395 patients with cancer treatments in structured EHR, this pipeline identified cancer treatments from notes for additional 2,303 patients who did not have any available cancer treatment data in the structured EHR. Our evaluation schema indicates that the longitudinal cancer drug recognition pipeline delivers strong performance (named entity recognization for drugs and temporal: F1 = 95%; drug-temporal relation recognition: F1 = 90%). Conclusions: We developed a high-performance BiLSTM-CRF based NLP pipeline to recognize longitudinal cancer treatments. The pipeline recovers and encodes as twice as many patients with cancer treatments compared with structured EHR. Our study indicates deep NLP with temporal reasoning could substantially accelerate the extraction of treatment profiles at scale. The pipeline is adjustable and can be applied across different cancers.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call