Abstract

BackgroundSelection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.ResultsWe investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.ConclusionBecause of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

Highlights

  • Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance

  • We have compared the predictive performance of the variable selection approach with: a) random forest without any variable selection; b) three other methods that have shown good performance in reviews of classification methods with microarray data [7,31,32] but that do not include any variable selection; c) three methods that carry out variable selection

  • For the three methods that do not carry out variable selection, Diagonal Linear Discriminant Analysis (DLDA), K nearest neighbor (KNN), and

Read more

Summary

Introduction

Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). An arbitrary decision as to the number of genes to retain is made (e.g., keep the 50 best ranked genes and use them with a linear discriminant analysis as in [1,7]; keep the best 150 genes as in [8]) This approach, it can be appropriate when the only objective is to classify samples, is not the most appropriate if the objective is to obtain the smaller possible sets of genes that will allow good predictive performance.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call