Abstract

Breast cancer outcome can be predicted using models derived from gene expression data or clinical data. Only a few studies have created a single prediction model using both gene expression and clinical data. These studies often remain inconclusive regarding an obtained improvement in prediction performance. We rigorously compare three different integration strategies (early, intermediate, and late integration) as well as classifiers employing no integration (only one data type) using five classifiers of varying complexity. We perform our analysis on a set of 295 breast cancer samples, for which gene expression data and an extensive set of clinical parameters are available as well as four breast cancer datasets containing 521 samples that we used as independent validation.mOn the 295 samples, a nearest mean classifier employing a logical OR operation (late integration) on clinical and expression classifiers significantly outperforms all other classifiers. Moreover, regardless of the integration strategy, the nearest mean classifier achieves the best performance. All five classifiers achieve their best performance when integrating clinical and expression data. Repeating the experiments using the 521 samples from the four independent validation datasets also indicated a significant performance improvement when integrating clinical and gene expression data. Whether integration also improves performances on other datasets (e.g. other tumor types) has not been investigated, but seems worthwhile pursuing. Our work suggests that future models for predicting breast cancer outcome should exploit both data types by employing a late OR or intermediate integration strategy based on nearest mean classifiers.

Highlights

  • Many predictors of breast cancer outcome have been published

  • This is a clear indication that there is synergy between the two data types, and that the late OR integration strategy provides a way to exploit the synergy

  • Integration Results in Higher area under the curve (AUC) Performance In the DLCV procedure, we optimized the number of features by minimizing the eFPFN error

Read more

Summary

Introduction

Many predictors of breast cancer outcome have been published. These predictors have been derived from gene expression data, such as the 70-gene (Veer et al [1]), and 76-gene (Wang et al [2]) signatures, or clinical data, such as the Nottingham Prognostic Index (NPI, [3]) and AdjuvantOnline! tools [4]. Stratifications for ER and HER2 have been made using gene expression data rather than clinical data, which could lead to better prognostic value [8]. Most of these studies have employed a set of standard clinical variables, such as ER status, tumor grade, tumor size, etc. Horlings et al (In preparation, [9]) have characterized additional clinical features (e.g. matrix formation, central fibrosis, etc.) for an existing cohort of 295 breast cancer samples [10] By themselves, these additional clinical variables have independent prognostic power. If and how this power can be used to build a better classifier for outcome prediction has not been investigated

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call