Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

Thanh-Tung Nguyen,Thuy Thi Nguyen,Qingyao Wu,Mark Junjie Li,Joshua Zhexue Huang

doi:10.1186/1471-2164-16-s2-s5

Abstract

BackgroundSingle-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree.ResultsThis approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed most existing the-state-of-the-art random forests. The top 25 SNPs in Parkinson data set were identified by the proposed model including four interesting genes associated with neurological disorders.ConclusionThe presented approach has shown to be effective in selecting informative sub-groups of SNPs potentially associated with diseases that traditional statistical approaches might fail. The new RF works well for the data where the number of case-control objects is much smaller than the number of SNPs, which is a typical problem in gene data and GWAS. Experiment results demonstrated the effectiveness of the proposed RF model that outperformed the state-of-the-art RFs, including Breiman's RF, GRRF and wsRF methods.

Highlights

Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis
We propose to use a new approach in learning Random Forests (RF) model using a two-stage quality-based SNP subspace selection method, which is tailored for high dimensional data of GWA studies
The ts-RF and wsRF models were implemented as multi-thread processes, while other models were run as single-thread processes

Summary

Introduction

Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. We propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. With genome-wide genotyping of single nucleotide polymorphisms (SNPs) in the human genome, it is possible to evaluate disease-associated SNPs for helping unravel the genetic basis of complex genetic diseases [1]. SNPs are single nucleotide variations of DNA base pairs, and it has been well established in the genomewide association studies (GWAS) field that SNP profiles characterize a variety of diseases [2]. The task is to identify genetic susceptibility of SNPs through assaying and analyzing SNPs at the genomewide scale [3]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Jan 21, 2015
Citations: 101	License type: cc-by

R Discovery Prime

R Discovery Prime

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests
Qingyao Wu ... Yang Liu
IEEE Transactions on NanoBioscience | VOL. 11
Qingyao Wu, et. al.Qingyao Wu ... Yang Liu
01 Sep 2012
IEEE Transactions on NanoBioscience | VOL. 11

Stratified Random Forest for Genome-wide Association Study
Qingyao Wu ... Michael Ng
-
Qingyao Wu, et. al.Qingyao Wu ... Michael Ng
01 Nov 2011
01 Nov 2011

LDL-cholesterol concentrations: a genome-wide association study
...
The Lancet | VOL. 371
, et. al. ...
01 Feb 2008
The Lancet | VOL. 371

Finding type 2 diabetes causal single nucleotide polymorphism combinations and functional modules from genome-wide association data
Chiyong Kang ... Hyeji Yu
BMC Medical Informatics and Decision Making | VOL. 13
Chiyong Kang, et. al.Chiyong Kang ... Hyeji Yu
01 Apr 2013
BMC Medical Informatics and Decision Making | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics