Perm-seq: Mapping Protein-DNA Interactions in Segmental Duplication and Highly Repetitive Regions of Genomes with Prior-Enhanced Read Mapping.

Xin Zeng,Ye Zheng,Constanza Rojo,Bo Li,Colin N Dewey,Sündüz Keleş,Rene Welch

doi:10.1371/journal.pcbi.1004491

Abstract

Segmental duplications and other highly repetitive regions of genomes contribute significantly to cells’ regulatory programs. Advancements in next generation sequencing enabled genome-wide profiling of protein-DNA interactions by chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq). However, interactions in highly repetitive regions of genomes have proven difficult to map since short reads of 50–100 base pairs (bps) from these regions map to multiple locations in reference genomes. Standard analytical methods discard such multi-mapping reads and the few that can accommodate them are prone to large false positive and negative rates. We developed Perm-seq, a prior-enhanced read allocation method for ChIP-seq experiments, that can allocate multi-mapping reads in highly repetitive regions of the genomes with high accuracy. We comprehensively evaluated Perm-seq, and found that our prior-enhanced approach significantly improves multi-read allocation accuracy over approaches that do not utilize additional data types. The statistical formalism underlying our approach facilitates supervising of multi-read allocation with a variety of data sources including histone ChIP-seq. We applied Perm-seq to 64 ENCODE ChIP-seq datasets from GM12878 and K562 cells and identified many novel protein-DNA interactions in segmental duplication regions. Our analysis reveals that although the protein-DNA interactions sites are evolutionarily less conserved in repetitive regions, they share the overall sequence characteristics of the protein-DNA interactions in non-repetitive regions.

Highlights

Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) is widely used for studying in vivo protein-DNA interactions genome-wide
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) has become a versatile high throughput assay for profiling of transcription factor (TF) binding and histone modifications
Utilizing a large number of ENCODE ChIP-seq datasets from GM12878 and K562 cells, we show that DNaseseq has significant power for discriminating between the mapping locations of multi-reads with similar local ChIP-seq read counts

Summary

Introduction

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) has become a versatile high throughput assay for profiling of transcription factor (TF) binding and histone modifications. There have been some prior efforts in developing ChIP-seq specific mappers that can allocate multi-mapping reads to one of their mapping positions based on local counts of uniquely mapping reads [1,2,3,4,5] (uni-reads), the standard practice for ChIPseq experiments is to either use only uniquely mapping reads or retain a conservative set of multi-mapping reads (e.g., with at most 2–3 mapping positions) and utilize one of the mapping positions randomly [6] This bottleneck has serious downstream effects when characterizing regulatory elements common or specific to distinct cell types where, for example, cell-type specific characteristics that reside in repetitive regions are grossly under-represented. Utilization of multi-mapping reads is especially important for characterizing regulatory activity in segmental duplications or LINE elements that harbor near-identical DNA sequences

Methods

Results

Conclusion