Abstract

Considerable research efforts have been devoted to probabilistic modeling of genetic population structures within the past decade. In particular, a wide spectrum of Bayesian models have been proposed for unlinked molecular marker data from diploid organisms. Here we derive a theoretical framework for learning genetic population structure of a haploid organism from bi-allelic markers for which potential patterns of dependence are a priori unknown and to be explicitly incorporated in the model. Our framework is based on the principle of minimizing stochastic complexity of an unsupervised classification under tree augmented factorization of the predictive data distribution. We discuss a fast implementation of the learning framework using deterministic algorithms.

Highlights

  • The concept of a structured or subdivided population has received intensive attention both in applied and theoretical population genetics for several decades

  • Despite the fact that the traditional formulation is still widely utilized in applied population genetics, a complementary approach to determining how subdivided a population is, has grown popular within the most recent decade

  • The classification itself will be made notationally explicit in a later section, whereas here we seek a probabilistic representation of the possible dependencies among the d allele frequencies of the individual loci in any particular pool

Read more

Summary

Introduction

The concept of a structured or subdivided population has received intensive attention both in applied and theoretical population genetics for several decades Such populations can be considered to harbor multiple pools of individuals, each associated with distinct allele frequencies over a set of Entropy 2010, 12 molecular marker loci. Despite the fact that the traditional formulation is still widely utilized in applied population genetics, a complementary approach to determining how subdivided a population is, has grown popular within the most recent decade This approach is fundamentally based on statistical learning of a mixture model for genotype data from multiple marker loci, where the mixture components are representing gene pools that have drifted apart over time. In the current work we derive a generalization of unsupervised classification approach to inferring genetic population structures from biallelic multilocus data, where the loci are no longer assumed to be conditionally independent given a pool. The final sections thereafter discuss deterministic algorithms for learning the optimal population structure under the introduced framework and provide some concluding remarks

Results and Discussion
Discussion
Prior predictive data distributions under Chow expansion
Asymptotic expansion of the stochastic complexity for a Chow expansion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.