Abstract

Current Hi-C analysis approaches are unable to account for reads that align to multiple locations, and hence underestimate biological signal from repetitive regions of genomes. We developed and validated mHi-C, a multi-read mapping strategy to probabilistically allocate Hi-C multi-reads. mHi-C exhibited superior performance over utilizing only uni-reads and heuristic approaches aimed at rescuing multi-reads on benchmarks. Specifically, mHi-C increased the sequencing depth by an average of 20% resulting in higher reproducibility of contact matrices and detected interactions across biological replicates. The impact of the multi-reads on the detection of significant interactions is influenced marginally by the relative contribution of multi-reads to the sequencing depth compared to uni-reads, cis-to-trans ratio of contacts, and the broad data quality as reflected by the proportion of mappable reads of datasets. Computational experiments highlighted that in Hi-C studies with short read lengths, mHi-C rescued multi-reads can emulate the effect of longer reads. mHi-C also revealed biologically supported bona fide promoter-enhancer interactions and topologically associating domains involving repetitive genomic regions, thereby unlocking a previously masked portion of the genome for conformation capture studies.

Highlights

  • DNA is highly compressed in the nucleus and organized into a complex three-dimensional structure

  • We developed mHi-C (Figure 1—figure supplements 1 and 2), a hierarchical model that probabilistically allocates Hi-C multi-reads to their most likely genomic origins by utilizing specific characteristics of the paired-end reads of the Hi-C assay. mHi-C is implemented as a full analysis pipeline that starts from unaligned read files and produces a set of statistically significant interactions at a given resolution

  • For developing mHi-C and studying its operating characteristics, we utilized six published studies, resulting in eight datasets with multiple replicates, as summarized in Table 1 and with more details in Figure 1—source data 1: Table 1. These datasets represent a variety of study designs from different organisms, that is human and mouse cell lines as examples of large genomes and three different stages of Plasmodium falciparum red blood cell cycle as an example of a small and AT-rich genome

Read more

Summary

Introduction

DNA is highly compressed in the nucleus and organized into a complex three-dimensional structure. This compressed form brings distal functional elements into close spatial proximity of each other (Dekker et al, 2002; de Laat and Duboule, 2013) and has a far-reaching influence on gene regulation. Hi-C technique (Lieberman-Aiden et al, 2009; Rao et al, 2014) emerged as a high throughput technology for interrogating the three-dimensional configuration of the genome and identifying regions that are in close spatial proximity in a genomewide fashion. A key and common step in these approaches is the exclusive use of uniquely mapping reads. Limiting the usable reads to only uniquely mapping reads underestimates signal originating from repetitive regions of the

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call