A haplotype inference algorithm for trios based on deterministic sampling

Alexandros Iliadis,Xiaodong Wang,John Watkinson,Dimitris Anastassiou

doi:10.1186/1471-2156-11-78

Alexandros Iliadis, Xiaodong Wang + Show 2 more

Open Access

https://doi.org/10.1186/1471-2156-11-78

Copy DOI

Journal: BMC genetics	Publication Date: Aug 23, 2010
Citations: 27	License type: cc-by

Affiliation: Columbia University

Abstract

BackgroundIn genome-wide association studies, thousands of individuals are genotyped in hundreds of thousands of single nucleotide polymorphisms (SNPs). Statistical power can be increased when haplotypes, rather than three-valued genotypes, are used in analysis, so the problem of haplotype phase inference (phasing) is particularly relevant. Several phasing algorithms have been developed for data from unrelated individuals, based on different models, some of which have been extended to father-mother-child "trio" data.ResultsWe introduce a technique for phasing trio datasets using a tree-based deterministic sampling scheme. We have compared our method with publicly available algorithms PHASE v2.1, BEAGLE v3.0.2 and 2SNP v1.7 on datasets of varying number of markers and trios. We have found that the computational complexity of PHASE makes it prohibitive for routine use; on the other hand 2SNP, though the fastest method for small datasets, was significantly inaccurate. We have shown that our method outperforms BEAGLE in terms of speed and accuracy for small to intermediate dataset sizes in terms of number of trios for all marker sizes examined. Our method is implemented in the "Tree-Based Deterministic Sampling" (TDS) package, available for download at http://www.ee.columbia.edu/~anastas/tdsConclusionsUsing a Tree-Based Deterministic sampling technique, we present an intuitive and conceptually simple phasing algorithm for trio data. The trade off between speed and accuracy achieved by our algorithm makes it a strong candidate for routine use on trio datasets.

Highlights

In genome-wide association studies, thousands of individuals are genotyped in hundreds of thousands of single nucleotide polymorphisms (SNPs)
We created 20 datasets, each of them consisting of 4000 haplotypes with 20 Mb of marker data using the “best-fit” parameters obtained from fitting a coalescent model to the real data
We have introduced a new algorithm for inferring haplotype phase in nuclear families using a Tree-Based Deterministic sampling scheme

Summary

Introduction

In genome-wide association studies, thousands of individuals are genotyped in hundreds of thousands of single nucleotide polymorphisms (SNPs). Statistical power can be increased when haplotypes, rather than three-valued genotypes, are used in analysis, so the problem of haplotype phase inference (phasing) is relevant. Since there are numerous haplotype arrangements for heterozygous SNPs that are consistent with the available three-level genotyped values, the problem of inferring haplotype phase ("phasing”) becomes relevant. Such inference is based on modelling the mechanisms and the biological processes generating sequence variation. “trio” data consisting of genotypes given in father-motherchild triplets are obtained in genome-wide association studies and some phasing algorithms are adapted to be used in this type of data

Methods

Results

Discussion

Conclusion