Haplotype inference based on Hidden Markov Models in the QTL-MAS 2010 multi-generational dataset

Carl Nettelblad

doi:10.1186/1753-6561-5-s3-s10

Abstract

BackgroundWe have previously demonstrated an approach for efficient computation of genotype probabilities, and more generally probabilities of allele inheritance in inbred as well as outbred populations. That work also included an extension for haplotype inference, or phasing, using Hidden Markov Models. Computational phasing of multi-thousand marker datasets has not become common as of yet. In this communication, we further investigate the method presented earlier for such problems, in a multi-generational dataset simulated for QTL detection.ResultsWhen analyzing the dataset simulated for the 14th QTLMAS workshop, the phasing produced showed zero deviations compared to original simulated phase in the founder generation. In total, 99.93% of all markers were correctly phased. 97.68% of the individuals were correct in all markers over all 5 simulated chromosomes. Results were produced over a weekend on a small computational cluster. The specific algorithmic adaptations needed for the Markov model training approach in order to reach convergence are described.ConclusionsOur method provides efficient, near-perfect haplotype inference allowing the determination of completely phased genomes in dense pedigrees. These developments are of special value for applications where marker alleles are not corresponding directly to QTL alleles, thus necessitating tracking of allele origin, and in complex multi-generational crosses. The cnF2freq codebase, which is in a current state of active development, is available under a BSD-style license.

Highlights

We have previously demonstrated an approach for efficient computation of genotype probabilities, and more generally probabilities of allele inheritance in inbred as well as outbred populations
64 cores on Intel Core 2 Quad 2.66 GHz CPUs distributed over 8 nodes in a cluster were used for computations
The code is written in C++, parallelised using OpenMP and the MPI support in the Boost library [12]

Summary

Introduction

We have previously demonstrated an approach for efficient computation of genotype probabilities, and more generally probabilities of allele inheritance in inbred as well as outbred populations. Computational phasing of multi-thousand marker datasets has not become common as of yet In this communication, we further investigate the method presented earlier for such problems, in a multi-generational dataset simulated for QTL detection. Most of the research in reconstructing haplotypes from unphased data, like application of the EM algorithm [1], Clark’s algorithm [2], and certain Bayesian methods [3] were designed for In this communication, we focus on reconstruction of haplotypes in experimental crosses of different designs. A highly efficient method, with excellent convergence properties thanks to a specially adapted optimisation algorithm, is presented This method has previously been briefly discussed with application to an earlier QTL-MAS workshop dataset [5] where it was shown to surpass the phasing results produced with other methods [6]

Methods

Results

Conclusion