Using natural language processing and machine learning to identify breast cancer local recurrence

Zexian Zeng,Sasa Espino,Xia Jiang,Xiaoyu Li,Yuan Luo,Ankita Roy,Susan E Clare,Richard Neapolitan,Seema A Khan

doi:10.1186/s12859-018-2466-x

Zexian Zeng, Sasa Espino + Show 7 more

Open Access

https://doi.org/10.1186/s12859-018-2466-x

Copy DOI

Abstract

BackgroundIdentifying local recurrences in breast cancer from patient data sets is important for clinical research and practice. Developing a model using natural language processing and machine learning to identify local recurrences in breast cancer patients can reduce the time-consuming work of a manual chart review.MethodsWe design a novel concept-based filter and a prediction model to detect local recurrences using EHRs. In the training dataset, we manually review a development corpus of 50 progress notes and extract partial sentences that indicate breast cancer local recurrence. We process these partial sentences to obtain a set of Unified Medical Language System (UMLS) concepts using MetaMap, and we call it positive concept set. We apply MetaMap on patients’ progress notes and retain only the concepts that fall within the positive concept set. These features combined with the number of pathology reports recorded for each patient are used to train a support vector machine to identify local recurrences.ResultsWe compared our model with three baseline classifiers using either full MetaMap concepts, filtered MetaMap concepts, or bag of words. Our model achieved the best AUC (0.93 in cross-validation, 0.87 in held-out testing).ConclusionsCompared to a labor-intensive chart review, our model provides an automated way to identify breast cancer local recurrences. We expect that by minimally adapting the positive concept set, this study has the potential to be replicated at other institutions with a moderately sized training dataset.

Highlights

Identifying local recurrences in breast cancer from patient data sets is important for clinical research and practice
We modeled the task as a classification problem and reported the probability
A total of 17,897, 4150, and 57,612 features were generated for baselines ‘full MetaMap’, ‘filtered MetaMap’, and ‘bag of words’, respectively

Summary

Introduction

Identifying local recurrences in breast cancer from patient data sets is important for clinical research and practice. In order to improve breast cancer outcomes, many research groups have focused on developing new treatment strategies [1, 2], identifying new biomarkers [3], and studying related risk factors [4,5,6,7,8] Carrying out these studies requires a direct and effective outcome measure. The abundant data extracted from EHRs is an attractive resource for retrospective research, such as low-cost case-control studies. This resource has allowed researchers to conduct large cohort studies to answer various clinical questions. Biopsies and tumors stored in biobanks can be linked to the EHR

Objectives

Methods

Results

Conclusion