Abstract

We used the simulated data set from Genetic Analysis Workshop 15 Problem 3 to assess a two-stage approach for identifying single-nucleotide polymorphisms (SNPs) associated with rheumatoid arthritis (RA). In the first stage, we used random forests (RF) to screen large amounts of genetic data using the variable importance measure, which takes into account SNP interaction effects as well as main effects without requiring model specification. We used the simulated 9187 SNPs mimicking a 10 K SNP chip, along with covariates DR (the simulated DRB1 gentoype), smoking, and sex as input to the RF analyses with a training set consisting of 750 unrelated RA cases and 750 controls. We used an iterative RF screening procedure to identify a smaller set of variables for further analysis. In the second stage, we used the software program CaMML for producing Bayesian networks, and developed complex etiologic models for RA risk using the variables identified by our RF screening procedure. We evaluated the performance of this method using independent test data sets for up to 100 replicates.

Highlights

  • It is commonly believed that complex diseases are caused not by single genes acting alone, but by multiple genes and non-genetic factors interacting with one another

  • We identified the best surrogates for all risk loci (A-G) as the single-nucleotide polymorphisms (SNPs) with the highest linkage disequilibrium (LD) (r2) with risk loci from the answer files given with the Genetic Analysis Workshop 15 (GAW15) data (Table 1)

  • Risk variables identified by random forests (RF) We compared ITbp and IT0 top 50 for choosing a set of variables by comparing how often the best surrogates for loci A-G appeared in the variable set

Read more

Summary

Introduction

It is commonly believed that complex diseases are caused not by single genes acting alone, but by multiple genes and non-genetic factors interacting with one another. When risk-associated SNPs have small marginal effects but large interaction effects in the population, univariate methods will result in low power for detecting these SNPs. "Multi-locus" approaches consider interactions of multiple genes and environmental factors in identifying susceptibility loci for complex diseases [1]. Random Forests (RFs) [2] provide a powerful method for detecting interacting risk susceptibility SNPs (rSNPs) [3]. This method does not provide a model that delineates the interactions

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.