Abstract

BackgroundThe discovery of genetic associations is an important factor in the understanding of human illness to derive disease pathways. Identifying multiple interacting genetic mutations associated with disease remains challenging in studying the etiology of complex diseases. And although recently new single nucleotide polymorphisms (SNPs) at genes implicated in immune response, cholesterol/lipid metabolism, and cell membrane processes have been confirmed by genome-wide association studies (GWAS) to be associated with late-onset Alzheimer's disease (LOAD), a percentage of AD heritability continues to be unexplained. We try to find other genetic variants that may influence LOAD risk utilizing data mining methods.MethodsTwo different approaches were devised to select SNPs associated with LOAD in a publicly available GWAS data set consisting of three cohorts. In both approaches, single-locus analysis (logistic regression) was conducted to filter the data with a less conservative p-value than the Bonferroni threshold; this resulted in a subset of SNPs used next in multi-locus analysis (random forest (RF)). In the second approach, we took into account prior biological knowledge, and performed sample stratification and linkage disequilibrium (LD) in addition to logistic regression analysis to preselect loci to input into the RF classifier construction step.ResultsThe first approach gave 199 SNPs mostly associated with genes in calcium signaling, cell adhesion, endocytosis, immune response, and synaptic function. These SNPs together with APOE and GAB2 SNPs formed a predictive subset for LOAD status with an average error of 9.8% using 10-fold cross validation (CV) in RF modeling. Nineteen variants in LD with ST5, TRPC1, ATG10, ANO3, NDUFA12, and NISCH respectively, genes linked directly or indirectly with neurobiology, were identified with the second approach. These variants were part of a model that included APOE and GAB2 SNPs to predict LOAD risk which produced a 10-fold CV average error of 17.5% in the classification modeling.ConclusionsWith the two proposed approaches, we identified a large subset of SNPs in genes mostly clustered around specific pathways/functions and a smaller set of SNPs, within or in proximity to five genes not previously reported, that may be relevant for the prediction/understanding of AD.

Highlights

  • The discovery of genetic associations is an important factor in the understanding of human illness to derive disease pathways

  • single nucleotide polymorphisms (SNPs) from APOE and GAB2 were excluded from the analysis, since these are already known to be associated with Alzheimer’s disease (AD) in this data set [9]

  • out of bag (OOB) error rate is not a good estimation of test error in all instances here; as features are added to the forest the OOB error becomes a better estimator of the test error for these two data sets

Read more

Summary

Introduction

The discovery of genetic associations is an important factor in the understanding of human illness to derive disease pathways. There are far three established genes involved in EOAD and follow autosomal dominant inheritance APP (b-amyloid precursor protein), PSEN1 and PSEN2 (presenilin-dependent g-secretase activity cuts amyloid precursor proteins into b-amyloid peptides) [5,6]. Another well established genetic risk factor is APOE (it encodes a lipoprotein that may interact with accumulated b-amyloid); it manifests in the more common LOAD and its inheritance does not follow Mendelian principles [7,8]. A person who has one or two copies of ε4 may never develop AD, while another who does not carry the ε4 alleles may [8]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call