Abstract

With the development of high-throughput single-nucleotide polymorphism (SNP) technologies, the vast number of SNPs in smaller samples poses a challenge to the application of classical statistical procedures. A possible solution is to use a two-stage approach for case-control data in which, in the first stage, a screening test selects a small number of SNPs for further analysis. The second stage then estimates the effects of the selected variables using logistic regression (logReg). Here, we introduce a novel approach in which the selection of SNPs is based on the permutation importance estimated by random forests (RFs). For this, we used the simulated data provided for the Genetic Analysis Workshop 15 without knowledge of the true model.The data set was randomly split into a first and a second data set. In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging. In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci. We detected smoking, gender, and the parental DR alleles as covariates. After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.

Highlights

  • We showed that a two-staged approach with a screening of single-nucleotide polymorphisms (SNP) by random forests (RFs) is suitable to detect candidate SNPs in genome-wide association studies for complex diseases

  • To identify genetic polymorphisms predisposing for a complex disease, genome-wide association studies have become more promising with the advances in technological possibilities

  • The application of this approach is demonstrated by analyzing the simulated genome-wide scan for rheumatoid arthritis (RA), which was provided for the Genetic Analysis Workshop (GAW) 15, without knowledge of the true model

Read more

Summary

Introduction

To identify genetic polymorphisms predisposing for a complex disease, genome-wide association studies have become more promising with the advances in technological possibilities. The availability of a vast number of variables with uncertain dependency structures in comparatively small samples makes the application of classical statistical procedures difficult. The first stage selects a small number of SNPs for further analysis, whereas the second validates the findings in an independent sample. The second stage uses an independent sample to estimate the effects of the selected variables using logistic regression (logReg). The application of this approach is demonstrated by analyzing the simulated genome-wide scan for rheumatoid arthritis (RA), which was provided for the Genetic Analysis Workshop (GAW) 15, without knowledge of the true model

Objectives
Methods
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call