Abstract

MotivationChromatin Immunopreciptation (ChIP)-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice—once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, thus the true significance or reliability of peak calls remains unknown.ResultsUsing simulated and real ChIP-seq data, we show that three well-known peak callers, MACS, SICER and diffReps, output biased P-values and false discovery rate estimates that can be many orders of magnitude too optimistic. We propose a wrapper algorithm, RECAP, that uses resampling of ChIP-seq and control data to estimate a monotone transform correcting for biases built into peak calling algorithms. When applied to null hypothesis data, where there is no enrichment between ChIP-seq and control, P-values recalibrated by RECAP are approximately uniformly distributed. On data where there is genuine enrichment, RECAP P-values give a better estimate of the true statistical significance of candidate peaks and better false discovery rate estimates, which correlate better with empirical reproducibility. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls.Availability and implementationThe RECAP software is available through www.perkinslab.ca or on github at https://github.com/theodorejperkins/RECAP.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Chromatin Immunopreciptation followed by high-throughput sequencing, or ChIP-seq, has become a central approach to mapping transcription factor-DNA binding sites and studying the epigenome [11, 21, 28]

  • RECAP is a wrapper algorithm that is compatible with almost any peak caller, and in particular MACS, SICER and di↵Reps, for which we provide wrapping scripts

  • false discovery rates (FDRs) estimates based on recalibrated p-values are more reliable, and in particular, we show that FDR q-values for peaks in ENCODE data track well the reproducibility of those peaks between biological replicates

Read more

Summary

Introduction

Chromatin Immunopreciptation followed by high-throughput sequencing, or ChIP-seq, has become a central approach to mapping transcription factor-DNA binding sites and studying the epigenome [11, 21, 28]. Di↵Reps is designed to solve the di↵erential enrichment problem—the comparison of two ChIP-seqs instead of a ChIP-seq and a control—which again comes up in some of our experiments These approaches to peak calling di↵er in a number of ways, all three (any many others from the list cited above) follow a common two-stage pattern: First, candidate peaks are identified by analyzing the ChIP-seq data, and second, those candidate peaks are evaluated for significance by comparing ChIP-seq data with some kind of control data. We show that on a variety of di↵erent types of simulated null hypothesis ChIPseq data, where there is no actual enrichment, RECAP-recalibrated p-values are approximately uniformly distributed between zero and one—as should be the case for well-calibrated statistical hypothesis testing This gives a more intuitive way of choosing a significance cut-o↵ for peak calling, and allows us to look at whether default cuto↵s (such as the 10 5 raw p-value cuto↵ in MACS) are overly conservative or still too loose. RECAP allows for much more rigorous and rational analysis of the significance of enrichment in ChIP-seq data, while allowing researchers to continue using the peak calling algorithms they already prefer and have come to depend on

Results
RECAP: A wrapper algorithm that removes bias from peak caller p-values
Peak statistical signifance and FDRs estimated by RECAP
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call