Abstract

The diploid nature of the human genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. This lack of haplotype-level analyses can be explained by a lack of methods that can produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-specific single-cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. We provide comprehensive guidance on the required sequencing depths and reliably assign more than 95% of alleles (NA12878) to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different technologies represents an attractive solution to chart the genetic variation of diploid genomes.

Highlights

  • The diploid nature of the human genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome

  • Long-range haplotype information is needed to systematically study epistatic interactions between variants in enhancers and variants in their target genes or their promotors. This is critical as many variants that have been linked to traits in genome-wide association studies reside in enhancers[6] and enhancer-specific variants can show epistatic effects among one another[7], as well as with their target genes that are beyond the reach of linkage disequilibrium[8]

  • Sequencing technologies sample the human genome in the form of relatively short molecules and every read that spans at least two heterozygous variants can essentially be considered as a “mini haplotype” that can be assembled into longer haplotype segments by partially overlapping reads spanning the same variable locus[4]

Read more

Summary

Introduction

The diploid nature of the human genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. Sequencing technologies sample the human genome in the form of relatively short molecules (reads) and every read that spans at least two heterozygous variants can essentially be considered as a “mini haplotype” that can be assembled into longer haplotype segments by partially overlapping reads spanning the same variable locus[4] To this end, haplotype-informative reads need to be partitioned into two disjoint sets that represent the two haplotypes. It has been shown that to generate a reliable long-range haplotype scaffold, relatively high sequence coverage (ideally ~90-fold) is needed to reduce bias caused by crosslinks between non-homologous chromosomes[32] Because these haplotypes need to be inferred statistically, the probability that two heterozygous variants are correctly phased relative to each other deteriorates with increasing chromosomal distances

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call