Abstract

Evaluating the similarity of different measured variables is a fundamental task of statistics, and a key part of many bioinformatics algorithms. Here we propose a Bayesian scheme for estimating the correlation between different entities’ measurements based on high-throughput sequencing data. These entities could be different genes or miRNAs whose expression is measured by RNA-seq, different transcription factors or histone marks whose expression is measured by ChIP-seq, or even combinations of different types of entities. Our Bayesian formulation accounts for both measured signal levels and uncertainty in those levels, due to varying sequencing depth in different experiments and to varying absolute levels of individual entities, both of which affect the precision of the measurements. In comparison with a traditional Pearson correlation analysis, we show that our Bayesian correlation analysis retains high correlations when measurement confidence is high, but suppresses correlations when measurement confidence is low—especially for entities with low signal levels. In addition, we consider the influence of priors on the Bayesian correlation estimate. Perhaps surprisingly, we show that naive, uniform priors on entities’ signal levels can lead to highly biased correlation estimates, particularly when different experiments have widely varying sequencing depths. However, we propose two alternative priors that provably mitigate this problem. We also prove that, like traditional Pearson correlation, our Bayesian correlation calculation constitutes a kernel in the machine learning sense, and thus can be used as a similarity measure in any kernel-based machine learning algorithm. We demonstrate our approach on two RNA-seq datasets and one miRNA-seq dataset.

Highlights

  • A fundamental task in data analysis is to assess the relatedness of different measured variables

  • In comparison with a traditional Pearson correlation analysis, we show that our Bayesian correlation analysis retains high correlations when measurement confidence is high, but suppresses correlations when measurement confidence is low—especially for entities with low signal levels

  • Like traditional Pearson correlation, our Bayesian correlation calculation constitutes a kernel in the machine learning sense, and can be used as a similarity measure in any kernel-based machine learning algorithm

Read more

Summary

Introduction

A fundamental task in data analysis is to assess the relatedness of different measured variables. A p-value for rejecting a hypothesized value of T can be computed by summing only those probabilities that are less than or equal to the probability that Ric = 1; these are the possible outcomes that are considered “more extreme” than the actual outcome Performing this calculation for various values of T, we found that T 2 [0.0520, 5.7550] would not be rejected at the standard p-value threshold of 0.05, indicating more than 2 orders of magnitude uncertainty in the true level of entity i). Imagine that we have measured gene expression across three different conditions, each with exactly 106 sequencing depth, so that normalization is not an issue. We ought to be more confident in the yz correlation than in the wx correlation, because the higher sequencing depth in condition two means its measurements have greater precision. R code implementing our approach is available as supplementary information (S1 Code) or on the web at http://www.perkinslab.ca/Software.html

Results
Discussion
Rmax þ1 þ
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call