Abstract

e13624 Background: Cancer staging is instrumental in driving clinical management and trial enrollment, but staging data are generally unreliable and unstructured in the electronic health record (EHR). Advances in natural language processing (NLP) may facilitate clinical staging and documentation [1], but challenges to real-world implementation include (1) automatically identifying appropriate patients and reports from the EHR and (2) developing an unbiased dataset for training and validation [2]. We describe our institution’s novel approach to overcome these barriers while building an in-house NLP pipeline for clinical tumor staging of non-small-cell lung cancer (NSCLC). Methods: We identified patients by searching our EHR (Epic) for a molecular analysis test ordered specifically for pathological diagnoses of NSCLC at our institution. We used the test order date as the diagnosis proxy date (DPD). For each patient, we extracted imaging reports up to 16 weeks before and 6 weeks after the DPD. To derive primary tumor size, we analyzed the CT Chest or PET/CT report closest to the DPD using an oncology-trained NLP text extraction and labeling tool (John Snow Labs). We cleaned all extracted tumor size entities and identified the largest measurement linked to the lungs. We compared primary tumor measurements from the NLP pipeline to those in a preexisting, manually compiled cancer registry (CNEXT). We manually analyzed discrepancies through chart review. Results: 542 patients with a DPD between 11/2016 - 9/2023 were processed through the NLP pipeline. Of 443 patients with valid values in both the pipeline and CNEXT, 53% (234) were exact matches, and 20% (90) had a close match (within 0-5mm), yielding a 73% accuracy rate for values within 5mm. When mismatched values were manually reviewed, several cases in CNEXT were found to have a DPD differing by more than 3 months and tumor sizes derived from external reports. When these cases were excluded, 320 of the remaining 349 patients had valid values in both the pipeline and the updated manual review. In this refined population, 66% (213) were exact matches, and 15% (48) had a close match, yielding an 82% accuracy rate for values within 5mm. Conclusions: To our knowledge, this is the first report of a pathology-based method to automatically and reliably identify patients with NSCLC and their relevant imaging reports directly from the EHR. We used a prebuilt NLP tool to derive primary tumor sizes with relatively high accuracy and found that adding flags for timeline discrepancies and external reports can further improve validity. As we near completion of analogous pipelines for node and metastasis staging, we will develop methodology to identify subgroups of patients that can be clinically staged with near-perfect accuracy, ultimately aiming to substantially limit manual staging of uncomplicated cases. 1. Puts 2023. 2. Wang 2022.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call