Abstract

BackgroundWhile the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. Large gene expression datasets from microarray experiments already exist and many of these can be used to help assign potential functions to these genes. We have applied Support Vector Machines (SVM), a sigmoid fitting function and a stratified cross‐validation approach to analyze a large microarray experiment dataset from Drosophila melanogaster in order to predict possible functions for previously un‐annotated genes. A total of approximately 5043 different genes, or about one‐third of the predicted genes in the D. melanogaster genome, are represented in the dataset and 1854 (or 37%) of these genes are un‐annotated.Results39 Gene Ontology Biological Process (GO‐BP) categories were found with precision value equal or larger than 0.75, when recall was fixed at the 0.4 level. For two of those categories, we have provided additional support for assigning given genes to the category by showing that the majority of transcripts for the genes belonging in a given category have a similar localization pattern during embryogenesis. Additionally, by assessing the predictions using a confidence score, we have been able to provide a putative GO‐BP term for 1422 previously un‐annotated genes or about 77% of the un‐annotated genes represented on the microarray and about 19% of all of the un‐annotated genes in the D. melanogaster genome.ConclusionsOur study successfully employs a number of SVM classifiers, accompanied by detailed calibration and validation techniques, to generate a number of predictions for new annotations for D. melanogaster genes. The applied probabilistic analysis to SVM output improves the interpretability of the prediction results and the objectivity of the validation procedure.

Highlights

  • While the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation

  • Using microarray data from the life cycle of D. melanogaster [22], and a controlled vocabulary for annotation of biological processes associated with D. melanogaster genes from the Gene Ontology Consortium (GO-BP) [23], in this study we propose a method of predicting gene function of un-annotated genes in the D. melanogaster genome by using Support Vector Machines and a two-level data splitting rotation scheme for validation

  • Gene function prediction using Support Vector Machines The prediction of gene functions was performed with the use of a number of SVM classifiers

Read more

Summary

Introduction

While the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. Functional genomic data and in particular microarray mRNA expression data have been used by numerous researchers as a means to help predict the function of un-annotated genes. The analysis of such data is based on the premise that genes involved in a particular biological, molecular, and/or biochemical process are often co-expressed. A common approach to examine co-expression profiles from microarray experiment data is to use clustering analysis. In this type of analysis, genes are organized and grouped based on their expression profile, with genes having similar expression pattern being “clustered” (grouped) together. Popular clustering algorithms are hierarchical clustering, k-means clustering and Self Organizing Maps [3]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call