Abstract

Many previous studies have shown that by using variants of “guilt-by-association”, gene function predictions can be made with very high statistical confidence. In these studies, it is assumed that the “associations” in the data (e.g., protein interaction partners) of a gene are necessary in establishing “guilt”. In this paper we show that multifunctionality, rather than association, is a primary driver of gene function prediction. We first show that knowledge of the degree of multifunctionality alone can produce astonishingly strong performance when used as a predictor of gene function. We then demonstrate how multifunctionality is encoded in gene interaction data (such as protein interactions and coexpression networks) and how this can feed forward into gene function prediction algorithms. We find that high-quality gene function predictions can be made using data that possesses no information on which gene interacts with which. By examining a wide range of networks from mouse, human and yeast, as well as multiple prediction methods and evaluation metrics, we provide evidence that this problem is pervasive and does not reflect the failings of any particular algorithm or data type. We propose computational controls that can be used to provide more meaningful control when estimating gene function prediction performance. We suggest that this source of bias due to multifunctionality is important to control for, with widespread implications for the interpretation of genomics studies.

Highlights

  • Understanding the function of genes is one of the central challenges of biology [1,2,3]

  • We show that node degree underlies a large fraction of the performance of gene function prediction methods

  • As with the simple node degree ranking, we found that Individual Property Network (IPN) perform very well in gene function prediction tasks compared to results from the original association matrix

Read more

Summary

Introduction

Understanding the function of genes is one of the central challenges of biology [1,2,3]. The same gene may have different functions depending on context, which is in turn be defined partly by the presence of other gene products. The tumor suppressor TP53 has different functions depending on its interaction partners While we define ‘‘multifunctionality’’ precisely below, we intend the term to mean approximately ‘‘the number of functions a gene is involved in’’. We take a close look at how the degree of multifunctionality (whether it is known or not) interacts with the computational assignment of functions to genes. This seemingly esoteric issue turns out to have surprisingly deep implications in how high-throughput data sets are interpreted

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call