Abstract

BackgroundIn general, gene function prediction can be formalized as a classification problem based on machine learning technique. Usually, both labeled positive and negative samples are needed to train the classifier. For the problem of gene function prediction, however, the available information is only about positive samples. In other words, we know which genes have the function of interested, while it is generally unclear which genes do not have the function, i.e. the negative samples. If all the genes outside of the target functional family are seen as negative samples, the imbalanced problem will arise because there are only a relatively small number of genes annotated in each family. Furthermore, the classifier may be degraded by the false negatives in the heuristically generated negative samples.ResultsIn this paper, we present a new technique, namely Annotating Genes with Positive Samples (AGPS), for defining negative samples in gene function prediction. With the defined negative samples, it is straightforward to predict the functions of unknown genes. In addition, the AGPS algorithm is able to integrate various kinds of data sources to predict gene functions in a reliable and accurate manner. With the one-class and two-class Support Vector Machines as the core learning algorithm, the AGPS algorithm shows good performances for function prediction on yeast genes.ConclusionWe proposed a new method for defining negative samples in gene function prediction. Experimental results on yeast genes show that AGPS yields good performances on both training and test sets. In addition, the overlapping between prediction results and GO annotations on unknown genes also demonstrates the effectiveness of the proposed method.

Highlights

  • Gene function prediction can be formalized as a classification problem based on machine learning technique

  • The Annotating Genes with Positive Samples (AGPS) algorithm works as a conventional two-class Support Vector Machines (SVMs) here with parameters and negative set defined above

  • With the best parameters determined in the training procedure and all positive samples, PSoL was applied to find out putative positive samples from unknown genes

Read more

Summary

Introduction

Gene function prediction can be formalized as a classification problem based on machine learning technique Both labeled positive and negative samples are needed to train the classifier. With the rapid advance in high-throughput biotechnologies, such as yeast two-hybrid systems [1], protein complex [2,3] and microarray expression profiles [4], a large amount of biological data have been generated. These data are rich sources for deducing and understanding gene functions. With various kinds of high-throughput data, the machine learning techniques, especially Support Vector Machines (SVMs), have been used for predicting gene functions and shown promising results [16,17]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call