Abstract

Electronic health records (EHRs) provide a low-cost means of accessing detailed longitudinal clinical data for large populations. A lung cancer cohort assembled from EHR data would be a powerful platform for clinical outcome studies. To investigate whether a clinical cohort assembled from EHRs could be used in a lung cancer prognosis study. In this cohort study, patients with lung cancer were identified among 76 643 patients with at least 1 lung cancer diagnostic code deposited in an EHR in Mass General Brigham health care system from July 1988 to October 2018. Patients were identified via a semisupervised machine learning algorithm, for which clinical information was extracted from structured and unstructured data via natural language processing tools. Data completeness and accuracy were assessed by comparing with the Boston Lung Cancer Study and against criterion standard EHR review results. A prognostic model for non-small cell lung cancer (NSCLC) overall survival was further developed for clinical application. Data were analyzed from March 2019 through July 2020. Clinical data deposited in EHRs for cohort construction and variables of interest for the prognostic model were collected. The primary outcomes were the performance of the lung cancer classification model and the quality of the extracted variables; the secondary outcome was the performance of the prognostic model. Among 76 643 patients with at least 1 lung cancer diagnostic code, 42 069 patients were identified as having lung cancer, with a positive predictive value of 94.4%. The study cohort consisted of 35 375 patients (16 613 men [47.0%] and 18 756 women [53.0%]; 30 140 White individuals [85.2%], 1040 Black individuals [2.9%], and 857 Asian individuals [2.4%]) after excluding patients with lung cancer history and less than 14 days of follow-up after initial diagnosis. The median (interquartile range) age at diagnosis was 66.7 (58.4-74.1) years. The area under the receiver operating characteristic curves of the prognostic model for overall survival with NSCLC were 0.828 (95% CI, 0.815-0.842) for 1-year prediction, 0.825 (95% CI, 0.812-0.836) for 2-year prediction, 0.814 (95% CI, 0.800-0.826) for 3-year prediction, 0.814 (95% CI, 0.799-0.828) for 4-year prediction, and 0.812 (95% CI, 0.798-0.825) for 5-year prediction. These findings suggest the feasibility of assembling a large-scale EHR-based lung cancer cohort with detailed longitudinal clinical measurements and that EHR data may be applied in cancer progression with a set of generalizable approaches.

Highlights

  • Lung cancer has been the most commonly diagnosed cancer and leading cause of cancerrelated deaths for several decades.[1]

  • The area under the receiver operating characteristic curves of the prognostic model for overall survival with non–small cell lung cancer (NSCLC) were 0.828 for 1-year prediction, 0.825 for 2-year prediction, 0.814 for 3-year prediction, 0.814 for 4-year prediction, and 0.812 for 5-year prediction

  • Machine Learning Algorithm Using Electronic Health Records to Identify Lung Cancer Cohort. These findings suggest the feasibility of assembling a large-scale Electronic health records (EHRs)-based lung cancer cohort with detailed longitudinal clinical measurements and that EHR data may be applied in cancer progression with a set of generalizable approaches

Read more

Summary

Introduction

Lung cancer has been the most commonly diagnosed cancer and leading cause of cancerrelated deaths for several decades (not counting skin cancer).[1]. Patients with lung cancer have different outcomes based on various clinical factors.[3,4,5,6] A 2020 study[7] using data from Surveillance, Epidemiology, and End Results (SEER) found a significant reduction in mortality for lung cancer from 2013 to 2016, which was potentially associated with incidence reduction along with treatment advances. A large cohort with adequate clinical information is necessary to identify stable and reliable prognostic variables and the factors associated with improved survival outcomes. Many data elements are typically recorded as free text with different terms, which make natural language processing (NLP) a requisite technology for accurate data extraction and classification.[10,11] In particular, many clinical variables, such as lung cancer status, are not explicitly represented in EHRs but can be inferred based on multiple data elements via machine learning algorithms.[12,13,14]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call