Abstract 2118: Using electronic medical records to assemble a lung cancer cohort for prognosis study

Qianyu Yuan,David Christiani,Tianxi Cai,Tianrun Cai,Chuan Hong

doi:10.1158/1538-7445.am2020-2118

Abstract

Abstract Background: Electronic medical records (EMRs) provide a low-cost means of accessing longitudinal data on large populations with detailed information regarding diagnosis, clinical procedures, medications and tests. A lung cancer cohort assembled from EMR represents a powerful resource for studying prognosis. Methods: EMRs are from the Massachusetts General Hospital and Brigham and Women's Hospital using Partners HealthCare System Research Patient Data Registry from 1988 to October 18, 2018. A previously validated phenotyping algorithm was developed to identify lung cancer patients based on a gold standard set of 200 patients. Demographics and clinical characteristics were extracted from both structured data and clinical notes utilizing natural language processing tools. The quality of EMR cohort was assessed by comparing to Boston Lung Cancer Study (BLCS) cohort using overlapped population. Specifically, we assessed: 1) absolute differences: the agreements of diagnosis dates, histological type and clinical stage; 2) relative differences: the effect sizes of variables on overall survival estimated from Cox regression models. Results: The initial study population included 76,643 patients with a diagnostic code related to primary lung cancer. The phenotyping model identified 42,069 lung cancer patients with sensitivity of 75.2%, specificity of 90.0%, PPV of 94.4% and AUC of 0.927. A total of 5,053 patients overlapped in BLCS and EMR cohorts were used for quality assessment. The diagnosis dates from EMR agreed with diagnosis dates from BLCS cohort with median of 0 days. 10.9%, 7.5%, and 5.1% of the population had an absolute date discrepancy of more than 90 days, 180 days, and one year, respectively. The agreements of histological type and clinical stage were 90.1% and 82.8% respectively. Further analysis for overall survival showed high consistency in the two cohorts as Cox regression models controlled for age, sex, race, smoking status, histological type, stages and treatments yielded similar estimates. Conclusion: We assembled a large lung cancer cohort from EMRs using phenotyping algorithm and extraction strategies combining structured and unstructured data. The quality of analytic data from EMRs were compared with the well curated epidemiology study to ensure their suitability for lung cancer prognosis research. Citation Format: Qianyu Yuan, Tianrun Cai, Chuan Hong, Tianxi Cai, David Christiani. Using electronic medical records to assemble a lung cancer cohort for prognosis study [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 2118.

Full Text