Abstract

BackgroundOur goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation.ResultsSignature strength had the greatest influence on predictor performance. It was possible to develop almost perfect predictors with as few as 10 features if the fold difference in mean expression values were > 2 even when the spiked samples represented 10% of all samples. We also assessed the gene signature set size and strength for 9 real clinical prediction problems in six different breast cancer data sets.ConclusionsWe found sufficiently large and strong predictive signatures only for distinguishing ER-positive from ER-negative cancers, there were no strong signatures for more subtle prediction problems. Current statistical methods efficiently identify highly informative features in gene expression data if such features exist and accurate models can be built with as few as 10 highly informative features. Features can be considered highly informative if at least 2-fold expression difference exists between comparison groups but such features do not appear to be common for many clinically relevant prediction problems in human data sets.

Highlights

  • Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models

  • The effects of sample size and signature strength on spiked-in gene recovery First, we examined the spiked probe set recovery rates as function of signature strength and number of spiked samples

  • Similar trends were seen when 50 or 500 probe sets were spiked (Supplementary Results). These observations were consistent across all 3 data sets and indicate that fold increase in the expression value of informative probe sets has a major influence on feature recovery rate

Read more

Summary

Introduction

Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. Gene expression data are commonly used to develop multi-gene prediction models for various clinical classification problems. E. features) that are differentially expressed between the groups These informative features are considered as variables to train a multivariate classification model. The predictive performance of classifiers must depend on the number of informative features, the magnitude of difference in feature expression levels between the groups of interest, and the number of informative cases in each group. These critical parameters are expected to vary from classification problem to classification problem and from data set to data set. It is not well understood how each of these components influence the success of the classifier development process and what the minimum requirement to develop successful predictors might be

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call