Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?

Haleh Yasrebi,Viviane Praz,Peter Sperisen,Philipp Bucher,Jörg Hoheisel

doi:10.1371/journal.pone.0007431

Haleh Yasrebi, Viviane Praz + Show 3 more

Open Access

https://doi.org/10.1371/journal.pone.0007431

Copy DOI

Abstract

BackgroundHigh-throughput gene expression profiling technologies generating a wealth of data, are increasingly used for characterization of tumor biopsies for clinical trials. By applying machine learning algorithms to such clinically documented data sets, one hopes to improve tumor diagnosis, prognosis, as well as prediction of treatment response. However, the limited number of patients enrolled in a single trial study limits the power of machine learning approaches due to over-fitting. One could partially overcome this limitation by merging data from different studies. Nevertheless, such data sets differ from each other with regard to technical biases, patient selection criteria and follow-up treatment. It is therefore not clear at all whether the advantage of increased sample size outweighs the disadvantage of higher heterogeneity of merged data sets. Here, we present a systematic study to answer this question specifically for breast cancer data sets. We use survival prediction based on Cox regression as an assay to measure the added value of merged data sets.ResultsUsing time-dependent Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and hazard ratio as performance measures, we see in overall no significant improvement or deterioration of survival prediction with merged data sets as compared to individual data sets. This apparently was due to the fact that a few genes with strong prognostic power were not available on all microarray platforms and thus were not retained in the merged data sets. Surprisingly, we found that the overall best performance was achieved with a single-gene predictor consisting of CYB5D1.ConclusionsMerging did not deteriorate performance on average despite (a) The diversity of microarray platforms used. (b) The heterogeneity of patients cohorts. (c) The heterogeneity of breast cancer disease. (d) Substantial variation of time to death or relapse. (e) The reduced number of genes in the merged data sets. Predictors derived from the merged data sets were more robust, consistent and reproducible across microarray platforms. Moreover, merging data sets from different studies helps to better understand the biases of individual studies and can lead to the identification of strong survival factors like CYB5D1 expression.

Highlights

Microarray gene expression data have been integrated to increase statistical power
Verification of Data Integration To assess the removal of microarray bias effect across data sets, Principal Component Analysis (PCA) and hierarchical clustering were applied to the data sets after the application of data integration methods
For the verification of data integration by PCA, the merged data sets were projected on the planes defined by the first two principal components (PCs)

Summary

Introduction

Increasing sample size is a bottleneck in DNA microarray-based gene expression studies as microarray experiments are time consuming, expensive, noisy and limited to the number of biological samples (cancer patients) To circumvent this problem, microarray gene expression data sets addressing the same or similar biological questions have been analyzed jointly either by so-called meta analysis [1,2,3,4,5], which means integration at the level of results derived separately from individual data sets, or by data merging [6,7,8,9,10,11,12,13,14,15,16]. We use survival prediction based on Cox regression as an assay to measure the added value of merged data sets

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PloS one	Publication Date: Oct 23, 2009
Citations: 95	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

Merging microsatellite data: enhanced methodology and software to combine genotype data for linkage and association analysis
Angela P Presson ... Daniel E Weeks
BMC bioinformatics | VOL. 9
Angela P Presson, et. al.Angela P Presson ... Daniel E Weeks
21 Jul 2008
BMC bioinformatics | VOL. 9

Xia2.multiplex: a multi-crystal data-analysis pipeline.
Richard J Gildea ... James Beilsten-Edmands
Acta Crystallographica Section D Biological Crystallography | VOL. 78
Richard J Gildea, et. al.Richard J Gildea ... James Beilsten-Edmands
18 May 2022
Acta Crystallographica Section D Biological Crystallography | VOL. 78

Merging Sets of Taxonomically Organized Data Using Concept Mappings under Uncertainty
David Thau ... Shawn Bowers
-
David Thau, et. al.David Thau ... Shawn Bowers
01 Jan 2009
01 Jan 2009

Quantitative Integration of Multiple Near‐Surface Geophysical Techniques for Reduced Uncertainty in Discrete Anomaly Detection
Megan Carr ... Gregory Baker
-
Megan Carr, et. al.Megan Carr ... Gregory Baker
01 Jan 2010
01 Jan 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one