Abstract

Abstract Background: Real-world evidence (RWE) studies for surveillance patterns following lung cancer (LC) diagnosis can inform optimizing recommendations on surveillance and practice. One major obstacle in RWE studies for LC surveillance is the lack of radiologic imaging indication for surveillance vs. other reasons (e.g., symptoms). To enable RWE studies for surveillance to detect second primary lung cancer among LC survivors, we developed a hybrid modelling approach that integrates structured data from electronic health records (EHRs) with natural language processing (NLP) from radiology reports for abstracting computed tomography (CT) imaging indications in LC survivors. Methods: We manually reviewed and abstracted CT imaging indications, i.e., surveillance vs. others (e.g., symptoms and metastatic disease follow-up) to create a gold standard from 200 randomly selected radiology reports among 1,952 LC patients (i) who were diagnosed in 2000-2017 at Stanford Health Care (SHC) and (ii) survived ≧5 years after the diagnosis. We abstracted medically relevant key-phrases using the part-of-speech grammar and PageRank algorithms. Hierarchical clustering identified context-specific key-phrase clusters as follows: “surveillance”, “stable”, “nodule”, “symptom”, and “metastasis”. The text-based radiology reports were vectorized to generate NLP features using phrase occurrence frequencies. The structured variables from EHRs included: (i) diagnosis of lung diseases or chest symptoms in previous 6 months, (ii) ordering provider-type (oncology vs. others [e.g. emergency and internal medicine]), and (iii) time from previous CT (≧6 months). A hybrid model was then fitted using logistic regression including both structured and NLP features and validated using a 10-fold cross-validation. The model’s performance was compared to those solely based on NLP or structured data. Results: The dataset of 200 radiology reports included 141 LC survivors (49% White, 72% adenocarcinoma). The proposed hybrid model showed high discrimination (AUC: 0.92), outperforming those based solely on NLP (AUC: 0.88) or structured data (AUC: 0.87). The proposed model demonstrated higher sensitivity (SN: 0.73) and specificity (SP: 0.96) versus those solely based on NLP (SN: 0.53; SP: 0.96) or structured data (SN: 0.53; SP: 0.90). The hybrid model showed that the following variables were positively associated with a higher likelihood that the given CT imaging indication is “surveillance”: (i) a longer time interval (≧6 months) from the previous CT (odds ratio [OR]: 4.63; p=0.01) and key-phrases of (ii) “nodule” (OR: 1.55; p=0.004) and (iii) “stable” (OR: 1.37; p=0.03). On the other hand, the following were negatively associated with the likelihood of surveillance: the key-phrases of “symptom” (OR: 0.17; p=0.02) and “metastasis” (OR: 0.26; p=0.02). Conclusion: A hybrid modeling approach combining text-based NLP and structured EHRs has the potential for abstracting CT imaging indications for LC surveillance. Future directions include validation using other EHR systems and extension using larger data. Citation Format: Aparajita Khan, Julie Wu, Eunji Choi, Anna Graber-Naidich, Solomon Henry, Heather A. Wakelee, Allison W. Kurian, Su-Ying Liang, Ann Leung, Curtis Langlotz, Leah M. Backhus, Summer S. Han. A hybrid modelling approach for abstracting CT imaging indications by integrating natural language processing from radiology reports with structured data from electronic health records. [abstract]. In: Proceedings of the AACR Special Conference: Precision Prevention, Early Detection, and Interception of Cancer; 2022 Nov 17-19; Austin, TX. Philadelphia (PA): AACR; Can Prev Res 2023;16(1 Suppl): Abstract nr P068.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call