Abstract

Detection of discriminating patterns in gene expression data can be accomplished by using various methods of statistical learning. It has been proposed that sample pooling in this context would have negative effects; however, pooling cannot always be avoided. We propose a simulation framework to explicitly investigate the parameters of patterns, experimental design, noise, and choice of method in order to find out which effects on classification performance are to be expected. We use a two-group classification task and simulated gene expression data with independent differentially expressed genes as well as bivariate linear patterns and the combination of both. Our results show a clear increase of prediction error with pool size. For pooled training sets powered partial least squares discriminant analysis outperforms discriminance analysis, random forests, and support vector machines with linear or radial kernel for two of three simulated scenarios. The proposed simulation approach can be implemented to systematically investigate a number of additional scenarios of practical interest.

Highlights

  • Detection of discriminating patterns in gene expression data can be accomplished by using various methods of statistical learning

  • The prediction error of linear discriminant analysis (LDA) almost doubles for each step in size of the pools

  • The results of the two support vector machine methods are similar with prediction errors between 0.23 and 0.38

Read more

Summary

Introduction

Detection of discriminating patterns in gene expression data can be accomplished by using various methods of statistical learning. Given high individual variances in feature levels (e.g., gene expression), combining several single biomarkers to a so-called biomarker signature (or biomarker panel) has been proposed to help [4] Such a signature is more robust with respect to outlier results as a single marker. Classification tools have been developed for analysing multiple features to reveal the optimal biosignature for discrimination, as for example discriminance analysis, partial least squares approaches, support vector machines, or random forests [2, 4, 5]. Many of these methods can cope with both massive collinearity of features as well as the “megavariate” nature of these datasets, where the number of variables is mostly larger

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call