Important biological information uncovered in previously unaligned reads from chromatin immunoprecipitation experiments (ChIP-Seq).

Wilberforce Zachary Ouma,Erich Grotewold,Andrea I Doseff,Pablo Pareja-Tobes,Maria Katherine Mejia-Guerra,Alper Yilmaz,Wei Li

doi:10.1038/srep08635

Abstract

Establishing the architecture of gene regulatory networks (GRNs) relies on chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) methods that provide genome-wide transcription factor binding sites (TFBSs). ChIP-Seq furnishes millions of short reads that, after alignment, describe the genome-wide binding sites of a particular TF. However, in all organisms investigated an average of 40% of reads fail to align to the corresponding genome, with some datasets having as much as 80% of reads failing to align. We describe here the provenance of previously unaligned reads in ChIP-Seq experiments from animals and plants. We show that a substantial portion corresponds to sequences of bacterial and metazoan origin, irrespective of the ChIP-Seq chromatin source. Unforeseen was the finding that 30%–40% of unaligned reads were actually alignable. To validate these observations, we investigated the characteristics of the previously unaligned reads corresponding to TAL1, a human TF involved in lineage specification of hemopoietic cells. We show that, while unmapped ChIP-Seq read datasets contain foreign DNA sequences, additional TFBSs can be identified from the previously unaligned ChIP-Seq reads. Our results indicate that the re-evaluation of previously unaligned reads from ChIP-Seq experiments will significantly contribute to TF target identification and determination of emerging properties of GRNs.

Highlights

Establishing the architecture of gene regulatory networks (GRNs) relies on chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) methods that provide genome-wide transcription factor binding sites (TFBSs)
We investigated the characteristics of the previously unaligned reads corresponding to TAL1, a human transcription factors (TFs) involved in lineage specification of hemopoietic cells
Our results indicate that the re-evaluation of previously unaligned reads from Chromatin immunoprecipitation (ChIP)-Seq experiments will significantly contribute to TF target identification and determination of emerging properties of GRNs

Summary

Introduction

Establishing the architecture of gene regulatory networks (GRNs) relies on chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) methods that provide genome-wide transcription factor binding sites (TFBSs). Our results indicate that the re-evaluation of previously unaligned reads from ChIP-Seq experiments will significantly contribute to TF target identification and determination of emerging properties of GRNs. Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-Seq) allows the in vivo characterization of genome-wide maps of protein-DNA interactions and epigenetic modifications. An in-depth ChIP-Seq analysis of one such potentially legitimate human reads dataset resulted in identification of novel TAL1 binding sites These findings are important because the use of legitimate previously unaligned reads in identifying additional TFBSs results in the discovery of new target genes that could enhance construction of gene regulatory grids and networks. All the sources of bias affecting results from ChIP-Seq data are far from being characterized In this regard, expanding exploration to different portions of the data and/or including information from a wide group of techniques, organisms, and analysis pipelines is likely to uncover other limitations associated with ChIP-Seq experiments

Methods

Results

Conclusion