Abstract

BackgroundHidden Markov models of haplotype inheritance such as the Li and Stephens model allow for computationally tractable probability calculations using the forward algorithm as long as the representative reference panel used in the model is sufficiently small. Specifically, the monoploid Li and Stephens model and its variants are linear in reference panel size unless heuristic approximations are used. However, sequencing projects numbering in the thousands to hundreds of thousands of individuals are underway, and others numbering in the millions are anticipated.ResultsTo make the forward algorithm for the haploid Li and Stephens model computationally tractable for these datasets, we have created a numerically exact version of the algorithm with observed average case sublinear runtime with respect to reference panel size k when tested against the 1000 Genomes dataset.ConclusionsWe show a forward algorithm which avoids any tradeoff between runtime and model complexity. Our algorithm makes use of two general strategies which might be applicable to improving the time complexity of other future sequence analysis algorithms: sparse dynamic programming matrices and lazy evaluation.

Highlights

  • Hidden Markov models of haplotype inheritance such as the Li and Stephens model allow for computationally tractable probability calculations using the forward algorithm as long as the representative reference panel used in the model is sufficiently small

  • Our contributions We have developed an arithmetically exact forward algorithm whose expected time complexity is a function of the expected allele distribution of the reference panel

  • We have developed a technique for succinctly representing large panels of haplotypes whose size scales as a sublinear function of the expected allele distribution

Read more

Summary

Results

Implementation Our algorithm was implemented as a C++ library located at https://github.com/yoheirosen/sublinear-Li-Stephens. We built indices with multiallelic sites, which increases their time and memory profile relative to the results in "Minor allele frequency distribution for the 1000 Genomes dataset" section but allows direct comparison to vcf records. Discussions and Conclusion To the best of our knowledge, ours is the first forward algorithm for any haplotype model to attain sublinear time complexity with respect to reference panel size. Favourable conditions for efficient time complexity of the lazy evaluation algorithm are Condition 1 The number of unique update maps added per step is constant with respect to number of states k. Example 1 (Diploid Li and Stephens) We have yet to implement this model but expect average runtime at least subquadratic in reference panel size k. Author details 1 UCSC Genomics Institute, 1156 High St, Santa Cruz, CA 95064, USA. 2 NYU School of Medicine, 550 First Ave, New York, NY 10016, USA

Conclusions
Background
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.