Machine learning and natural language processing (NLP) approach to predict early progression to first-line treatment in real-world hormone receptor-positive (HR+)/HER2-negative advanced breast cancer patients.

Nuria Ribelles,Leo Franco,Ester Villar,Bella Pajares,Ana Godoy,Emilio Alba,Maria Bermejo ,Pablo Rodríguez-Brazzarola ,Tamara Díaz-Redondo ,Alfonso Sánchez-Muñoz ,Héctor Mesa ,José M Jerez ,Laura Gálvez ,F Carabantes ,Begoña Jiménez ,Sofía Ruiz-Medina ,Antonia Márquez ,E Sáez ,Irene López ,Maria Emilia Domínguez-Recio

doi:10.1016/j.ejca.2020.11.030

Abstract

CDK4/6 inhibitors plus endocrine therapies are the current standard of care in the first-line treatment of HR+/HER2-negative metastatic breast cancer, but there are no well-established clinical or molecular predictive factors for patient response. In the era of personalised oncology, new approaches for developing predictive models of response are needed. Data derived from the electronic health records (EHRs) of real-world patients with HR+/HER2-negative advanced breast cancer were used to develop predictive models for early and late progression to first-line treatment. Two machine learning approaches were used: a classic approach using a data set of manually extracted features from reviewed (EHR) patients, and a second approach using natural language processing (NLP) of free-text clinical notes recorded during medical visits. Of the 610 patients included, there were 473 (77.5%) progressions to first-line treatment, of which 126 (20.6%) occurred within the first 6 months. There were 152 patients (24.9%) who showed no disease progression before 28 months from the onset of first-line treatment. The best predictive model for early progression using the manually extracted dataset achieved an area under the curve (AUC) of 0.734 (95% CI 0.687-0.782). Using the NLP free-text processing approach, the best model obtained an AUC of 0.758 (95% CI 0.714-0.800). The best model to predict long responders using manually extracted data obtained an AUC of 0.669 (95% CI 0.608-0.730). With NLP free-text processing, the best model attained an AUC of 0.752 (95% CI 0.705-0.799). Using machine learning methods, we developed predictive models for early and late progression to first-line treatment of HR+/HER2-negative metastatic breast cancer, also finding that NLP-based machine learning models are slightly better than predictive models based on manually obtained data.

Full Text