Abstract

Microarrays have been useful in understanding various biological processes by allowing the simultaneous study of the expression of thousands of genes. However, the analysis of microarray data is a challenging task. One of the key problems in microarray analysis is the classification of unknown expression profiles. Specifically, the often large number of non-informative genes on the microarray adversely affects the performance and efficiency of classification algorithms. Furthermore, the skewed ratio of sample to variable poses a risk of overfitting. Thus, in this context, feature selection methods become crucial to select relevant genes and, hence, improve classification accuracy. In this study, we investigated feature selection methods based on gene expression profiles and protein interactions. We found that in our setup, the addition of protein interaction information did not contribute to any significant improvement of the classification results. Furthermore, we developed a novel feature selection method that relies exclusively on observed gene expression changes in microarray experiments, which we call “relative Signal-to-Noise ratio” (rSNR). More precisely, the rSNR ranks genes based on their specificity to an experimental condition, by comparing intrinsic variation, i.e. variation in gene expression within an experimental condition, with extrinsic variation, i.e. variation in gene expression across experimental conditions. Genes with low variation within an experimental condition of interest and high variation across experimental conditions are ranked higher, and help in improving classification accuracy. We compared different feature selection methods on two time-series microarray datasets and one static microarray dataset. We found that the rSNR performed generally better than the other methods.

Highlights

  • DNA microarrays can be classified into static experiments, where a snapshot of gene expression in different samples is measured, and time series experiments, where a temporal process is measured over a period

  • We evaluated how gene expression profiles from two different time points can be combined, yielding transition profiles

  • We first showed that the nearest neighbor method performs comparably to the method developed by Hafemeister et al We found that for the purpose of classification, mean expression profiles describe time-series transitions based on microarray experiments better than differential expression profiles and single time point expression profiles

Read more

Summary

Introduction

DNA microarrays can be classified into static experiments, where a snapshot of gene expression in different samples is measured, and time series experiments, where a temporal process is measured over a period. An interesting problem in microarray analysis is the classification of unknown expression profiles with the goal of assigning them to one or many predefined classes Such classes represent various phenotypes, for example, diseases. Classifying microarray data by cross-comparing microarray data from different laboratories and phenotypes could be helpful to identify unknown samples, but to reveal obscure associations between complex phenotypes, such as shared pathogenic pathways among different diseases. Such approaches have been made more feasible in recent years with the availability of large database repositories of high throughput gene expression data, such as the Gene Expression Omnibus (GEO) [2]. Reducing the number of genes using feature selection methods results in a more efficient management of the computational resources and a lower the risk of overfitting, and enables a better biological understanding of the data

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.