Abstract

This article proposes a novel approach for Individual Human phasing through discovery of interesting hidden relations among single variant sites. The proposed framework, called ARHap, learns strong association rules among variant loci on the genome and develops a combinatorial approach for fast and accurate haplotype phasing based on the discovered associations. ARHap is composed of two main modules or processing phases. In the first phase, called association rule learning, ARHap identifies quantitative association rules from a collection of DNA reads of the organism under study, resulting in a set of strong rules that reveal the inter-dependency of alleles. In the next phase, called haplotype reconstruction, we develop algorithms to utilize the learned rules to construct highly reliable haplotypes at individual single nucleotide polymorphism (SNP) sites. ARHap has several features that lead to both fast and accurate haplotyping. It uses an incremental haplotype reconstruction approach that enables us to generate association rules according to the unreconstructed SNP sites during each round of the algorithm. During each round, the association rule learning module generates rules while constraining the length of the rules and limiting the rules to those that contribute to reconstruction of unreconstructed sites only. The framework begins by generating rules of small size and highly strong. The rule length can increase and/or criteria about strongness of the rule are adjusted gradually, during subsequent rounds, if some SNP sites have remained unreconstructed. This adaptive approach, which uses feedback from haplotype reconstruction module, eliminates generation of rules that do not contribute to haplotype reconstruction as well as weak rules that may introduce error in the final haplotypes. Extensive experimental analyses on datasets representing diploid organisms demonstrate superiority of ARHap in diploid haplotyping compared to the state-of-the-art algorithms. In particular, we show that this novel approach to haplotype phasing not only is fast but also achieves significantly better accuracy performance compared to other read-based computational approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call