Abstract

BackgroundRanking and identifying biomarkers that are associated with disease from genome-wide measurements holds significant promise for understanding the genetic basis of common diseases. The large number of single nucleotide polymorphisms (SNPs) in genome-wide studies (GWAS), however, makes this task computationally challenging when the ranking is to be done in a multivariate fashion. This paper evaluates the performance of a multivariate graph-based method called label propagation (LP) that efficiently ranks SNPs in genome-wide data.ResultsThe performance of LP was evaluated on a synthetic dataset and two late onset Alzheimer’s disease (LOAD) genome-wide datasets, and the performance was compared to that of three control methods. The control methods included chi squared, which is a commonly used univariate method, as well as a Relief method called SWRF and a sparse logistic regression (SLR) method, which are both multivariate ranking methods. Performance was measured by evaluating the top-ranked SNPs in terms of classification performance, reproducibility between the two datasets, and prior evidence of being associated with LOAD.On the synthetic data LP performed comparably to the control methods. On GWAS data, LP performed significantly better than chi squared and SWRF in classification performance in the range from 10 to 1000 top-ranked SNPs for both datasets, and not significantly different from SLR. LP also had greater ranking reproducibility than chi squared, SWRF, and SLR. Among the 25 top-ranked SNPs that were identified by LP, there were 14 SNPs in one dataset that had evidence in the literature of being associated with LOAD, and 10 SNPs in the other, which was higher than for the other methods.ConclusionLP performed considerably better in ranking SNPs in two high-dimensional genome-wide datasets when compared to three control methods. It had better performance in the evaluation measures we used, and is computationally efficient to be applied practically to data from genome-wide studies. These results provide support for including LP in the methods that are used to rank SNPs in genome-wide datasets.

Highlights

  • Ranking and identifying biomarkers that are associated with disease from genome-wide measurements holds significant promise for understanding the genetic basis of common diseases

  • In addition to Single nucleotide polymorphism (SNP) directly named in the literature as having an association with late onset Alzheimer’s disease (LOAD), we considered a wider degree of plausible associations

  • Biomarker discovery in Genome-wide association study (GWAS) data is a challenging problem with the potential for many false positives and the lack of reproducibility across datasets

Read more

Summary

Introduction

Ranking and identifying biomarkers that are associated with disease from genome-wide measurements holds significant promise for understanding the genetic basis of common diseases. The large number of single nucleotide polymorphisms (SNPs) in genome-wide studies (GWAS), makes this task computationally challenging when the ranking is to be done in a multivariate fashion. This paper evaluates the performance of a multivariate graph-based method called label propagation (LP) that efficiently ranks SNPs in genome-wide data. The typical analysis of GWAS data involves the application of a univariate feature ranking method that evaluates each SNP’s strength of association with disease independently of all other SNPs. For example, the chi squared statistic is used to assess the expected and observed genotypes of a SNP in cases and controls in a GWAS, and ranks. Multivariate feature ranking methods evaluate each SNP’s strength of association with disease in the context of other SNPs. For example, Relief is capable of detecting complex SNP-SNP dependencies even in the absence of main effects.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call