Abstract

BackgroundHigh-throughput microarray experiments now permit researchers to screen thousands of genes simultaneously and determine the different expression levels of genes in normal or cancerous tissues. In this paper, we address the challenge of selecting a relevant and manageable subset of genes from a large microarray dataset. Currently, most gene selection methods focus on identifying a set of genes that can further improve classification accuracy. Few or none of these small sets of genes, however, are biologically relevant (i.e. supported by medical evidence). To deal with this critical issue, we propose two novel methods that can identify biologically relevant genes concerning cancers.ResultsIn this paper, we propose two novel techniques, entitled random forest gene selection (RFGS) and support vector sampling technique (SVST). Compared with results from six other methods developed in this paper, we demonstrate experimentally that RFGS and SVST can identify more biologically relevant genes in patients with leukemia or prostate cancer. Among the top 25 genes selected using SVST method, 15 genes were biologically relevant genes in patients with leukemia and 13 genes were biologically relevant genes in patients with prostate cancer. Meanwhile, the RFGS method, while less effective than SVST, still identified an average of 9 biologically relevant genes in both leukemia and prostate cancers. In contrast to traditional statistical methods, which only identify less than 8 genes in patients with leukemia and less than 8 genes in patients with prostate cancer, our methods yield significantly better results.ConclusionsOur proposed SVST and RFGS methods are novel approaches that can identify a greater number of biologically relevant genes. These methods have been successfully applied to both leukemia and prostate cancers. Research in the fields of biology and medicine should benefit from the identification of biologically relevant genes by confirming recent discoveries in cancer research or suggesting new avenues for exploration.

Highlights

  • High-throughput microarray experiments permit researchers to screen thousands of genes simultaneously and determine the different expression levels of genes in normal or cancerous tissues

  • In addition to the six statistical methods described in the previous section, we propose two machine learning-based gene selection methods: Random Forest Gene Selection (RFGS) and Support Vector Sampling Technique (SVST)

  • Among these 8 methods, Weighted Punishment on Overlap (WEPO) finds the least number of biological genes at 5 genes, while Threshold Number of Misclassification (TNoM) identifies 6 genes

Read more

Summary

Introduction

High-throughput microarray experiments permit researchers to screen thousands of genes simultaneously and determine the different expression levels of genes in normal or cancerous tissues. There are several statistical and machine learning techniques such as t-Test, k-nearest neighbors, clustering methods [8], self organizing maps (SOM) [9], genetic algorithm [10], backpropagation neural network [11,12,13], probabilistic neural network, decision tree [14], random forest [15], and support vector machines (SVM) [16,17] that have been applied in selecting informative genes These methods can select smaller set of informative genes, only a small percentage of these so called "informative" genes are biologically relevant as proved by medical experiments. We present a novel approach that addresses different considerations, including: (1) the identification of quality samples, (2) the selection of a small set of informative genes from these samples, (3) the comparison of these genes with medical literature, and (4) the interpretation of their biological relevance

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call