Abstract
BackgroundWith the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of SNPs for disease susceptibility prediction is a challenging task. This study aimed to use single nucleotide polymorphisms (SNPs) to predict psoriasis from searching GWAS data.MethodsTotally we had 2,798 samples and 451,724 SNPs. Process for searching a set of SNPs to predict susceptibility for psoriasis consisted of two steps. The first one was to search top 1,000 SNPs with high accuracy for prediction of psoriasis from GWAS dataset. The second one was to search for an optimal SNP subset for predicting psoriasis. The sequential information bottleneck (sIB) method was compared with classical linear discriminant analysis(LDA) for classification performance.ResultsThe best test harmonic mean of sensitivity and specificity for predicting psoriasis by sIB was 0.674(95% CI: 0.650-0.698), while only 0.520(95% CI: 0.472-0.524) was reported for predicting disease by LDA. Our results indicate that the new classifier sIB performs better than LDA in the study.ConclusionsThe fact that a small set of SNPs can predict disease status with average accuracy of 68% makes it possible to use SNP data for psoriasis prediction.
Highlights
With the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of single nucleotide polymorphisms (SNPs) for disease susceptibility prediction is a challenging task
We used linear discriminant analysis (LDA) as a classifier and harmonic mean of sensitivity and specificity (HMSS) evaluated in the cross-validation data set as a criterion to select an optimal subset of SNPs for prediction by forward selection (FS) or sequential forward floating selection (SFFS) algorithms
To determine how many SNPs should be used for prediction, we plotted Figure 1 showing cross-validation HMSS and test HMSS versus the number of markers searched through two feature selection algorithms
Summary
With the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of SNPs for disease susceptibility prediction is a challenging task. For the studies performed by Brinza et al [9] and Wang et al [10], the authors used cross-validation accuracy of the optimized subset as a measure of the final performance of the classifier This might be biased because the cross-validation accuracy for the best subset is maximized within a search algorithm wrapped around a classifier. In another genome-wide study[12], an independent data set was preserved, the power are limited because the sizes of sample for training and test data set are insufficient. Both the training sample dataset and the independent testing sample dataset numbered more than 1000
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.