Abstract

BackgroundHi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study.ResultsUsing real and simulated data, we profile the performance of several recently proposed methods for assessing reproducibility of population Hi-C data, including HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep. By explicitly controlling noise and sparsity through simulations, we demonstrate the deficiencies of performing simple correlation analysis on pairs of matrices, and we show that methods developed specifically for Hi-C data produce better measures of reproducibility. We also show how to use established measures, such as the ratio of intra- to interchromosomal interactions, and novel ones, such as QuASAR-QC, to identify low-quality experiments.ConclusionsIn this work, we assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices. Through this extensive validation and benchmarking of Hi-C data, we describe best practices for reproducibility and quality assessment of Hi-C experiments. We make all software publicly available at http://github.com/kundajelab/3DChromatin_ReplicateQC to facilitate adoption in the community.

Highlights

  • Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease

  • Application of the Hi-C assay has allowed researchers to profile the 3D genome during important biological processes such as cellular differentiation [2, 3], X inactivation [4,5,6], and cell division [7] and to identify hallmarks of 3D organization of chromatin, such as compartments [1], topologically associating domains (TADs) [8,9,10], and DNA loops [11]

  • After aligning and filtering of paired end sequencing reads, we obtain 10 to 61 million paired reads per experiment for 11 cell types and more than 400 million paired reads for the remaining two deeply sequenced cell types. These Hi-C interactions serve as a readout of three-dimensional proximity of the corresponding genomic loci

Read more

Summary

Introduction

Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. A rich collection of literature for assessing the quality and reproducibility of a large collection of nextgeneration sequencing-based genomics assays, such as ChIP-seq [14] and DNase-seq [15], has been compiled over the past decade [16,17,18]. Correlation coefficient [20,21,22] and statistical methods such as the irreproducible discovery rate (IDR) [17] have been used to measure the reproducibility of such assays All of these methods are designed to operate on data that is laid out in one dimension along the genome. Unlike other functional genomics assays, Hi-C data must be analyzed at an effective resolution determined by the user [13, 23, 24] For these reasons, existing methods for assessing genomic data quality and reproducibility are not directly applicable to Hi-C data

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call