Abstract

1556 Background: Identifying patients with a particular cancer and determining the date of that diagnosis from EHR data is important for selecting real world research cohorts and conducting downstream analyses. However, cancer diagnoses and their dates are often not accurately recorded in the EHR in a structured form. We developed a unified deep learning model for identifying patients with NSCLC and their initial and advanced diagnosis date(s). Methods: The study used a cohort of 52,834 patients with lung cancer ICD codes from the nationwide deidentified Flatiron Health EHR-derived database. For all patients in the cohort, abstractors used an in-house technology-enabled platform to identify an NSCLC diagnosis, advanced disease, and relevant diagnosis date(s) via chart review. Advanced NSCLC was defined as stage IIIB or IV disease at diagnosis or early stage disease that recurred or progressed. The deep learning model was trained on 38,517 patients, with a separate 14,317 patient test cohort. The model input was a set of sentences containing keywords related to (a)NSCLC, extracted from a patient’s EHR documents. Each sentence was associated with a date, using the document timestamp or, if present, a date mentioned explicitly in the sentence. The sentences were processed by a GRU network, followed by an attentional network that integrated across sentences, outputting a prediction of whether the patient had been diagnosed with (a)NSCLC and the diagnosis date(s) if so. We measured sensitivity and positive predictive value (PPV) of extracting the presence of initial and advanced diagnoses in the test cohort. Among patients with both model-extracted and abstracted diagnosis dates, we also measured 30-day accuracy, defined as the proportion of patients where the dates match to within 30 days. Real world overall survival (rwOS) for patients abstracted vs. model-extracted as advanced was calculated using Kaplan-Meier methods (index date: abstracted vs. model-extracted advanced diagnosis date). Results: Results in the Table show the sensitivity, PPV, and accuracy of the model extracted diagnoses and dates. RwOS was similar using model extracted aNSCLC diagnosis dates (median = 13.7) versus abstracted diagnosis dates (median = 13.3), with a difference of 0.4 months (95% CI = [0.0, 0.8]). Conclusions: Initial and advanced diagnosis of NSCLC and dates of diagnosis can be accurately extracted from unstructured clinical text using a deep learning algorithm. This can further enable the use of EHR data for research on real-world treatment patterns and outcomes analysis, and other applications such as clinical trials matching. Future work should aim to understand the impact of model errors on downstream analyses.[Table: see text]

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.