Abstract

MotivationA sequencing-based genomic assay such as ChIP-seq outputs a real-valued signal for each position in the genome that measures the strength of activity at that position. Most genomic signals lack the property of variance stabilization. That is, a difference between 0 and 100 reads usually has a very different statistical importance from a difference between 1000 and 1100 reads. A statistical model such as a negative binomial distribution can account for this pattern, but learning these models is computationally challenging. Therefore, many applications—including imputation and segmentation and genome annotation (SAGA)—instead use Gaussian models and use a transformation such as log or inverse hyperbolic sine (asinh) to stabilize variance.ResultsWe show here that existing transformations do not fully stabilize variance in genomic datasets. To solve this issue, we propose VSS, a method that produces variance-stabilized signals for sequencing-based genomic signals. VSS learns the empirical relationship between the mean and variance of a given signal dataset and produces transformed signals that normalize for this dependence. We show that VSS successfully stabilizes variance and that doing so improves downstream applications such as SAGA. VSS will eliminate the need for downstream methods to implement complex mean–variance relationship models, and will enable genomic signals to be easily understood by eye.Availability and implementation https://github.com/faezeh-bayat/VSS Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Sequencing-based assays can measure many types of genomic biochemical activity, including transcription factor (TF) binding, histone modifications and chromatin accessibility

  • We found that the variance has a strong dependence on the mean; genomic positions with low signals experience little variance across replicates, whereas positions with high signals experience much larger variance (Fig. 1c)

  • We found that a uniform variance model implied by using untransformed signals had a poor likelihood, reflecting nonuniform variance (Fig. 2b, panel Fold enrichment (FE))

Read more

Summary

Introduction

Sequencing-based assays can measure many types of genomic biochemical activity, including transcription factor (TF) binding, histone modifications and chromatin accessibility. These assays work by extracting DNA fragments from a sample that exhibit the desired type of activity, sequencing the fragments to produce sequencing reads and mapping each read to the genome. Read counts of genomic assays have a nonuniform mean–variance relationship, meaning that variance of the data is a function of the read counts, resulting in higher variance for higher read counts and lower variance for lower read counts, which poses a challenge to their analysis This property means that, e.g. the difference in read count between biosamples is a poor measure of the difference in activity. A locus having 100 reads in one replicate while 0 in the other is usually considered more significant than a locus with 1100 reads in one replicate and 1000 reads in the other one

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.