Abstract
1559 Background: Though smoking is a major risk factor for lung cancer, it has been a challenge to collect patients’ smoking history information accurately from the EH due to data inconsistency and incompleteness. To address these challenges, we utilized a weak supervision methodology to automatically annotate smoking status of patients with lung cancer and correlated it with tumor characteristics. Methods: We analyzed 6,355 patients with lung cancer who underwent tumor profiling with MSK-IMPACT. In total, 14,555 unstructured clinical notes were extracted from EHR at the Memorial Sloan Kettering Cancer Center. The weak supervision methodology used a generative model for intermediate labels that were subsequently tuned by machine-learning classifier to generate the final labels. Clinical notes from a randomly sampled set of 564 patients were manually curated and used for performance assessment. The rest of the patients were split into training and validation datasets used for model training and hyperparameter tuning. Pack years were also extracted from clinical notes using Natural Language Processing. We next conducted multivariate analyses for primary and metastatic tumor samples separately to correlate smoking metrics with tumor characteristics including tumor mutation burden (TMB) and chromosomal instability, as inferred by the fraction of genome altered (FGA) after controlling for age at sequencing, gender, histological subtypes, ancestry, coverage and tumor purity. Results: The weak supervision classifier had almost perfect performance for 2-label classification model (ever smokers and never smokers) with macro F1-score: 97.7%, balanced accuracy: 97.1%, 97.1%, precision:98.4%, 98.4% and recall: 99.5%,94.6% respectively. For 3-label classification model (never smoker, former smoker, and current smoker), the macro F1-score was 79.8%; balanced accuracy: 97.1%, 86.7%, 71.2%, precision: 93.9%, 90.1%, 61.7%, recall: 96.1%, 93.3%, 46.0% respectively. Analyzing genomic data, we observed that smoking status (smoker vs. never smoker) and pack-years were associated with TMB in both primary and metastatic tumor samples (p<2e-16). FGA was marginally associated with smokers compared to never smokers in primary tumor samples (p=0.06). Among smokers diagnosed with lung adenocarcinoma, significantly high FGA in primary tumor samples was observed in males compared to females after adjusting for pack-years and other variables (p= 3.3e-3). Conclusions: We demonstrated high performance of our approach for automated curation of smoking history from EHR. The genomic results confirmed distinct mutational patterns associated with smoking behavior in patients with lung cancer. We are currently exploring multimodal approaches by including chest CT images and “time of quitting” to improve performance of the 3-class model.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.