Abstract

The Guilt-by-Association (GBA) principle, according to which genes with similar expression profiles are functionally associated, is widely applied for functional analyses using large heterogeneous collections of transcriptomics data. However, the use of such large collections could hamper GBA functional analysis for genes whose expression is condition specific. In these cases a smaller set of condition related experiments should instead be used, but identifying such functionally relevant experiments from large collections based on literature knowledge alone is an impractical task. We begin this paper by analyzing, both from a mathematical and a biological point of view, why only condition specific experiments should be used in GBA functional analysis. We are able to show that this phenomenon is independent of the functional categorization scheme and of the organisms being analyzed. We then present a semi-supervised algorithm that can select functionally relevant experiments from large collections of transcriptomics experiments. Our algorithm is able to select experiments relevant to a given GO term, MIPS FunCat term or even KEGG pathways. We extensively test our algorithm on large dataset collections for yeast and Arabidopsis. We demonstrate that: using the selected experiments there is a statistically significant improvement in correlation between genes in the functional category of interest; the selected experiments improve GBA-based gene function prediction; the effectiveness of the selected experiments increases with annotation specificity; our algorithm can be successfully applied to GBA-based pathway reconstruction. Importantly, the set of experiments selected by the algorithm reflects the existing literature knowledge about the experiments. [A MATLAB implementation of the algorithm and all the data used in this paper can be downloaded from the paper website: http://www.paccanarolab.org/papers/CorrGene/].

Highlights

  • In the past decade, efforts for elucidating gene function have gained new impetus with the emergence of large scale transcriptomics and protein-protein interaction experiments

  • Our results show that using experiments selected by the algorithm leads to substantially improved correlation between genes in the same functional category compared to using large heterogeneous collections of experiments

  • Since the chosen subset of experiments should be constituted by experiments that most perturb the genes in the functional category of interest, we shall refer to these experiments as the relevant experiments

Read more

Summary

Introduction

Efforts for elucidating gene function have gained new impetus with the emergence of large scale transcriptomics and protein-protein interaction experiments. These datasets are mined to identify groups of genes sharing similar features, which implies that they may share similar functions – this principle has often been called Guilt-By-Association (GBA) [1,2,3,4]. GBA-based analyses often begin with the calculation of similarity between gene expression profiles using a metric such as Pearson’s correlation. Often, this has been performed over large heterogeneous collections of experiments. The significance of the correlation between vectors is likely to increase with the size of the vectors

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call