Whole-genome haplotyping approaches and genomic medicine

Gustavo Glusman,Jared C Roach,Hannah C Cox

doi:10.1186/s13073-014-0073-7

Abstract

Genomic information reported as haplotypes rather than genotypes will be increasingly important for personalized medicine. Current technologies generate diploid sequence data that is rarely resolved into its constituent haplotypes. Furthermore, paradigms for thinking about genomic information are based on interpreting genotypes rather than haplotypes. Nevertheless, haplotypes have historically been useful in contexts ranging from population genetics to disease-gene mapping efforts. The main approaches for phasing genomic sequence data are molecular haplotyping, genetic haplotyping, and population-based inference. Long-read sequencing technologies are enabling longer molecular haplotypes, and decreases in the cost of whole-genome sequencing are enabling the sequencing of whole-chromosome genetic haplotypes. Hybrid approaches combining high-throughput short-read assembly with strategic approaches that enable physical or virtual binning of reads into haplotypes are enabling multi-gene haplotypes to be generated from single individuals. These techniques can be further combined with genetic and population approaches. Here, we review advances in whole-genome haplotyping approaches and discuss the importance of haplotypes for genomic medicine. Clinical applications include diagnosis by recognition of compound heterozygosity and by phasing regulatory variation to coding variation. Haplotypes, which are more specific than less complex variants such as single nucleotide variants, also have applications in prognostics and diagnostics, in the analysis of tumors, and in typing tissue for transplantation. Future advances will include technological innovations, the application of standard metrics for evaluating haplotype quality, and the development of databases that link haplotypes to disease.

Highlights

Technological progress has enabled the routine resequencing of human genomes
Haplotype assembly for single human (HASH) uses a Markov chain Monte Carlo (MCMC) algorithm and graph partitioning approach to assemble haplotypes given a list of heterozygous variants and a set of shotgun sequence reads mapped to a reference genome assembly [21]
If a rare variant is assigned to a haplotype by other methods, its presence on a haplotype determined by common Single nucleotide polymorphism (SNP) can be probabilistically inferred [69]

Summary

Introduction

Technological progress has enabled the routine resequencing of human genomes. These genomes include rare variants at high frequency [1,2] that are the result of exponential human population growth over the past hundred generations [3]. Population-based haplotyping: The process of assigning the most likely order of common alleles along each haploid segment of DNA according to the frequency of observation in a large sample set This method constructs haplotypes from unordered genotype data. Extreme dilution of genomic DNA can generate longrange haplotypes without requiring the sorting of metaphase chromosomes or cloning These methods recreate, with twists, the basic method used to sequence the human genome: local haplotypes (in the order of tens of kilobases) are first carefully sequenced and strung together by aligning overlaps. HASH (haplotype assembly for single human) uses a Markov chain Monte Carlo (MCMC) algorithm and graph partitioning approach to assemble haplotypes given a list of heterozygous variants and a set of shotgun sequence reads mapped to a reference genome assembly [21]. HapCut uses the overlapping structure of the fragment matrix and max-cut computations to find the optimum minimum error correction (MEC)

Limitations*

Summary

Conclusions and future directions

70. Clark AG

Findings

94. Glusman G