Abstract

Researchers generating new genome-wide data in an exploratory sequencing study can gain biological insights by comparing their data with well-annotated data sets possessing similar genomic patterns. Data compression techniques are needed for efficient comparisons of a new genomic experiment with large repositories of publicly available profiles. Furthermore, data representations that allow comparisons of genomic signals from different platforms and across species enhance our ability to leverage these large repositories. Here, we present a signal processing approach that characterizes protein–chromatin interaction patterns at length scales of several kilobases. This allows us to efficiently compare numerous chromatin-immunoprecipitation sequencing (ChIP-seq) data sets consisting of many types of DNA-binding proteins collected from a variety of cells, conditions and organisms. Importantly, these interaction patterns broadly reflect the biological properties of the binding events. To generate these profiles, termed Arpeggio profiles, we applied harmonic deconvolution techniques to the autocorrelation profiles of the ChIP-seq signals. We used 806 publicly available ChIP-seq experiments and showed that Arpeggio profiles with similar spectral densities shared biological properties. Arpeggio profiles of ChIP-seq data sets revealed characteristics that are not easily detected by standard peak finders. They also allowed us to relate sequencing data sets from different genomes, experimental platforms and protocols. Arpeggio is freely available at http://sourceforge.net/p/arpeggio/wiki/Home/.

Highlights

  • The advent of automation and use of high-throughput sequencing techniques has brought a remarkable increase in the rate at which biological data sets are accumulated

  • It is plausible that similarities of genomic profiles from different experiments are due to similar biological mechanisms, and theoretically it is possible to use existing repositories to explore uncharted relationships between a variety of genomic signals in various biological systems and conditions

  • We investigated whether the space comprising all spectral densities in our database could be further reduced using MultiDimensional Scaling (MDS), such as Principal Component Analysis (PCA)

Read more

Summary

Introduction

The advent of automation and use of high-throughput sequencing techniques has brought a remarkable increase in the rate at which biological data sets are accumulated. Public repositories, such as the Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/sra), and The Cancer Genome Atlas Nih.gov/) already store thousands of genome-wide data sets from a variety of cells and biological conditions. It is plausible that similarities of genomic profiles from different experiments (e.g. binding profiles of transcription factors or transcriptomes of unique samples) are due to similar biological mechanisms, and theoretically it is possible to use existing repositories to explore uncharted relationships between a variety of genomic signals in various biological systems and conditions. Data compression techniques that allow comparison of genomic signals from different platforms and across

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call