Abstract Background: Accurate modelling of the impact of patient-specific features and cancer treatments on survival allows the assignment of targeted therapy. There has not been any effort to build a multi-source model for the survival analysis of breast cancer. We show in this study a prognostic model, which integrates genetic (DNA), clinical and therapy inputs to predict survival for early breast cancer (stages 1-3) for all breast cancer subtypes. Methods: We used a data-driven Random Survival Forest approach, a statistical non-parametric ensemble learning method, that incorporates censor and time-to-event data. The learning is performed by creating numerous decision trees and selecting the model based on the correct responses in unseen data. We used The Cancer Genome Atlas Breast Cancer (TCGA) dataset and observed improvements in the accuracy when more sources of data were used, in line with the previous research. Integrating the impact of non-silent somatic tumor mutations (whole exome) and gene copy number variation (CNV) were analyzed on all mutations and per particular mutation. Results: Data from 1096 women with stage 1-3 early breast cancer were inputs to the model n=437 ER+ve HER2-ve, n=123 HER2+ve ER+ve, n=40 HER2+ ER-ve and n=126 TNBC. Pathological stage 1, n=183; stage 2, n=620; and stage 3, n=249. The following chemotherapy and hormonal treatments were used in the analysis: anthracycline, taxanes, platinum, alkylating and anti-metabolite agents, anti-oestrogen, aromatase inhibitors, ovarian suppression and HER2 antibody treatment. The model accuracy for predicting survival for early breast cancer using only clinical data was 0.78 for Area Under Curve (AUC) and c-index. The predictive accuracy improved stepwise by adding hormone, genetic and treatment data to AUC of 0.86 and c-index to 0.85. We observed the same trend if the proportion of test data increased from 0.25 to 0.75. Changes in median genes FGFR2 and CDKN2A copy number were strongly prognostic with p=0.0001 and p=0.002, and weaker signals for CBFB p=0.05, HRAS p=0.06, AKT p=0.07. Conclusion: Using public datasets, we developed a predictive survival model for an individual with early breast cancer up to 5 years from diagnosis using multi-source and patient-specific data. We show that using this approach for survival analysis yields good accuracy. Citation Format: Aidan (Amanzhol) Kubeyev, Andrea Giorni, Prabu Siva, Luiz Silva, Jordan Laurie, Matthew Foster, Matthew Griffiths, Uzma Asghar. A prognostic machine learning model for early breast cancer which combines clinical and genetic data in patients treated with neo/adjuvant chemotherapy. [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 5696.
Read full abstract