Abstract PO-050: Identifying de novo stage IV breast cancer (DNIV) cases in Electronic Health Records (EHR) using natural language processing

Liwei Wang,Emmanuel Gabriel,Karthik Giridhar,James Jakub,Kimberly Corbin,Hongfang Liu,Sadia Choudhery,Feichen Shen,Brenda Ernst

doi:10.1158/1557-3265.adi21-po-050

Abstract

Abstract Background: DNIV accounts for 6%–10% of newly diagnosed breast cancer cases. Despite widespread mammography screening, its incidence is increasing in the United States and survival of this disease has only modestly improved since the late 1970s. As patient data accumulates in EHR, it’s promising to generate practice-based evidence through utilization of observational data sources. However, assembly of a DNIV cohort based on EHR data is challenging, as key pathologic and staging information are stored in unstructured clinical narratives and not available as structured data. In this study, we developed a rule-based algorithm to phenotype DNIV using natural language processing (NLP) techniques, and implemented the algorithm on our institutional EHR to extract potential DNIV cases. Methods and Results: We defined DNIV as those with either (1) M1 disease identified at time of initial presentation or M1 disease identified within 4 months after definitive surgery. We first developed a reference case list of DNIV verified by physician chart review of the EHR. We next refined the algorithm on a training dataset containing 51 positive and 38 negative reference cases. Next we tested the performance on the testing data containing 23 positive and 55 negative cases. The phenotyping algorithm identified key data elements using NLP, i.e., stage IV breast cancer, definitive surgery, stage 0-III, recurrent breast cancer and associated dates. To identify DNIV cases, phenotyping algorithm integrated temporal relations among the key data elements. The following steps were conducted in the following sequential order: (1) Identification of patients with breast cancer diagnosis using ICD-9 and ICD-10 codes. (2) Patients are positive cases if there are explicit mentions that delineate DNIV from recurrent metastatic breast cancer, such as “de novo stage IV” or “primary intact” detected by NLP at time of diagnosis. (3) Otherwise, patients with definitive surgery are selected if stage IV was within 5 years before definitive surgery or within 4 months after definitive surgery, along with patients with stage IV but without definitive surgery. (4) We further excluded patients with stage 0-III detected within 5 years after stage IV. (5) We further excluded patients with recurrent breast cancer detected before or at time of detection of stage IV. The remaining patients were left as positive cases. Precision of the algorithm was 70%, recall was 87% and F1, the weighted average of precision and recall, was 77%. We implemented our algorithm to interrogate 10 million clinical documents in a cohort of 56,548 patients with breast cancer diagnosis codes who presented to our institution between 2004 and 2018 and had research authorization. We identified 1918 potential DNIV cases. Conclusion: Our future focus is on algorithm refinement. An algorithm-generated cohort could serve as a data source for further study on outcomes related to DNIV and ideally as automated data abstractor and staging process. Citation Format: Liwei Wang, Karthik Giridhar, Kimberly Corbin, Brenda Ernst, Sadia Choudhery, Emmanuel Gabriel, Feichen Shen, Hongfang Liu, James Jakub. Identifying de novo stage IV breast cancer (DNIV) cases in Electronic Health Records (EHR) using natural language processing [abstract]. In: Proceedings of the AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; 2021 Jan 13-14. Philadelphia (PA): AACR; Clin Cancer Res 2021;27(5_Suppl):Abstract nr PO-050.

Full Text