Efficient haplotype block recognition of very long and dense genetic sequences

Daniel Taliun,Cristian Pattaro,Johann Gamper

doi:10.1186/1471-2105-15-10

Daniel Taliun, Cristian Pattaro + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-15-10

Copy DOI

Abstract

BackgroundThe new sequencing technologies enable to scan very long and dense genetic sequences, obtaining datasets of genetic markers that are an order of magnitude larger than previously available. Such genetic sequences are characterized by common alleles interspersed with multiple rarer alleles. This situation has renewed the interest for the identification of haplotypes carrying the rare risk alleles. However, large scale explorations of the linkage-disequilibrium (LD) pattern to identify haplotype blocks are not easy to perform, because traditional algorithms have at least Θ(n2) time and memory complexity.ResultsWe derived three incremental optimizations of the widely used haplotype block recognition algorithm proposed by Gabriel et al. in 2002. Our most efficient solution, called MIG ++, has only Θ(n) memory complexity and, on a genome-wide scale, it omits >80% of the calculations, which makes it an order of magnitude faster than the original algorithm. Differently from the existing software, the MIG ++ analyzes the LD between SNPs at any distance, avoiding restrictions on the maximal block length. The haplotype block partition of the entire HapMap II CEPH dataset was obtained in 457 hours. By replacing the standard likelihood-based D′ variance estimator with an approximated estimator, the runtime was further improved. While producing a coarser partition, the approximate method allowed to obtain the full-genome haplotype block partition of the entire 1000 Genomes Project CEPH dataset in 44 hours, with no restrictions on allele frequency or long-range correlations. These experiments showed that LD-based haplotype blocks can span more than one million base-pairs in both HapMap II and 1000 Genomes datasets. An application to the North American Rheumatoid Arthritis Consortium (NARAC) dataset shows how the MIG ++ can support genome-wide haplotype association studies.ConclusionsThe MIG ++ enables to perform LD-based haplotype block recognition on genetic sequences of any length and density. In the new generation sequencing era, this can help identify haplotypes that carry rare variants of interest. The low computational requirements open the possibility to include the haplotype block structure into genome-wide association scans, downstream analyses, and visual interfaces for online genome browsers.

Highlights

The new sequencing technologies enable to scan very long and dense genetic sequences, obtaining datasets of genetic markers that are an order of magnitude larger than previously available
We describe how we improved efficiency and scalability of the Haploview algorithm (1) by adopting an incremental computation of the haplotype blocks based on iterative chromosome scans and (2) by estimating D confidence intervals (CIs) using the approximate variance estimator proposed by Zapata et al [26]
We present three gradual improvements of the Haploview algorithm: a memory-efficient implementation based on the Gabriel et al [23] definition (MIG); MIG with additional search space pruning (MIG+); and MIG+ with iterative chromosomal processing (MIG++)

Summary

Introduction

The new sequencing technologies enable to scan very long and dense genetic sequences, obtaining datasets of genetic markers that are an order of magnitude larger than previously available Such genetic sequences are characterized by common alleles interspersed with multiple rarer alleles. When the two markers have very different allele frequencies, the interpretation of r2 becomes difficult This is especially relevant with the data generated by the new sequencing technologies, that allow genotyping markers over a very wide spectrum of allele frequencies. In such situations, the r2 may fail to identify the correct relationship between nearby variants. In GWAS, this may lead to a wrong definition of the identified loci

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 14, 2014
Citations: 138	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Efficient haplotype block recognition of very long and dense genetic sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Comparative study for haplotype block partitioning methods - Evidence from chromosome 6 of the North American Rheumatoid Arthritis Consortium (NARAC) dataset.
Mohamed N Saad ... Olfat G Shaker
PloS one | VOL. 13
Mohamed N Saad, et. al.Mohamed N Saad ... Olfat G Shaker
31 Dec 2019
PloS one | VOL. 13

Haplotype Block Partitioning for NARAC Dataset Using Interval Graph Modeling of Clusters Algorithm
Fatma S Ibrahim ... Ashraf M Said
-
Fatma S Ibrahim, et. al.Fatma S Ibrahim ... Ashraf M Said
01 Dec 2018
01 Dec 2018

Genome-wide association studies using single-nucleotide polymorphisms versus haplotypes: an empirical comparison with data from the North American Rheumatoid Arthritis Consortium
Heejung Shim ... Bret A Payseur
BMC Proceedings | VOL. 3
Heejung Shim, et. al.Heejung Shim ... Bret A Payseur
01 Dec 2009
BMC Proceedings | VOL. 3

Studying the effects of haplotype partitioning methods on the RA-associated genomic results from the North American Rheumatoid Arthritis Consortium (NARAC) dataset
Mohamed N Saad ... Olfat G Shaker
Journal of Advanced Research | VOL. 18
Mohamed N Saad, et. al.Mohamed N Saad ... Olfat G Shaker
18 Jan 2019
Journal of Advanced Research | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient haplotype block recognition of very long and dense genetic sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics