Abstract

BackgroundWith the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of SNPs for disease susceptibility prediction is a challenging task. This study aimed to use single nucleotide polymorphisms (SNPs) to predict psoriasis from searching GWAS data.MethodsTotally we had 2,798 samples and 451,724 SNPs. Process for searching a set of SNPs to predict susceptibility for psoriasis consisted of two steps. The first one was to search top 1,000 SNPs with high accuracy for prediction of psoriasis from GWAS dataset. The second one was to search for an optimal SNP subset for predicting psoriasis. The sequential information bottleneck (sIB) method was compared with classical linear discriminant analysis(LDA) for classification performance.ResultsThe best test harmonic mean of sensitivity and specificity for predicting psoriasis by sIB was 0.674(95% CI: 0.650-0.698), while only 0.520(95% CI: 0.472-0.524) was reported for predicting disease by LDA. Our results indicate that the new classifier sIB performs better than LDA in the study.ConclusionsThe fact that a small set of SNPs can predict disease status with average accuracy of 68% makes it possible to use SNP data for psoriasis prediction.

Highlights

  • With the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of single nucleotide polymorphisms (SNPs) for disease susceptibility prediction is a challenging task

  • We used linear discriminant analysis (LDA) as a classifier and harmonic mean of sensitivity and specificity (HMSS) evaluated in the cross-validation data set as a criterion to select an optimal subset of SNPs for prediction by forward selection (FS) or sequential forward floating selection (SFFS) algorithms

  • To determine how many SNPs should be used for prediction, we plotted Figure 1 showing cross-validation HMSS and test HMSS versus the number of markers searched through two feature selection algorithms

Read more

Summary

Introduction

With the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of SNPs for disease susceptibility prediction is a challenging task. For the studies performed by Brinza et al [9] and Wang et al [10], the authors used cross-validation accuracy of the optimized subset as a measure of the final performance of the classifier This might be biased because the cross-validation accuracy for the best subset is maximized within a search algorithm wrapped around a classifier. In another genome-wide study[12], an independent data set was preserved, the power are limited because the sizes of sample for training and test data set are insufficient. Both the training sample dataset and the independent testing sample dataset numbered more than 1000

Objectives
Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.