Abstract

BackgroundDiscovering the genetic basis of common genetic diseases in the human genome represents a public health issue. However, the dimensionality of the genetic data (up to 1 million genetic markers) and its complexity make the statistical analysis a challenging task.ResultsWe present an accurate modeling of dependences between genetic markers, based on a forest of hierarchical latent class models which is a particular class of probabilistic graphical models. This model offers an adapted framework to deal with the fuzzy nature of linkage disequilibrium blocks. In addition, the data dimensionality can be reduced through the latent variables of the model which synthesize the information borne by genetic markers. In order to tackle the learning of both forest structure and probability distributions, a generic algorithm has been proposed. A first implementation of our algorithm has been shown to be tractable on benchmarks describing 105 variables for 2000 individuals.ConclusionsThe forest of hierarchical latent class models offers several advantages for genome-wide association studies: accurate modeling of linkage disequilibrium, flexible data dimensionality reduction and biological meaning borne by latent variables.

Highlights

  • Discovering the genetic basis of common genetic diseases in the human genome represents a public health issue

  • Implementation Algorithm CFHLC has been developed in C++, relying on the ProBT library dedicated to Bayesian networks (BNs) http://bayesianprogramming.org

  • In Subsection Motivation of the FHLC model for genome-wide association studies (GWASs), we argued that the multiple layers of an forest of hierarchical latent class models (FHLCMs) can describe various degrees of linkage disequilibrium (LD) strength

Read more

Summary

Introduction

Discovering the genetic basis of common genetic diseases in the human genome represents a public health issue. The dimensionality of the genetic data (up to 1 million genetic markers) and its complexity make the statistical analysis a challenging task. Genetic markers such as SNPs are the key to dissecting the genetic susceptibility of common complex diseases, such as asthma, diabetes, atherosclerosis and some cancers [1]. Decreasing genotyping costs enable the generation of hundreds of thousands of SNPs, spanning the whole human genome, across cohorts of cases and controls. This scaling up to genome-wide association studies (GWASs) makes the analysis of high-dimensional data a hot topic [2]. Since SNP patterns, rather than single SNPs, are likely to be determinant for complex diseases, a high rate of false positives as well as a perceptible statistical power decrease, not to mention intractability, are severe issues to be overcome

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.