Abstract

BackgroundA great success of the genome wide association study enabled us to give more attention on the personal genome and clinical application such as diagnosis and disease risk prediction. However, previous prediction studies using known disease associated loci have not been successful (Area Under Curve 0.55 ~ 0.68 for type 2 diabetes and coronary heart disease). There are several reasons for poor predictability such as small number of known disease-associated loci, simple analysis not considering complexity in phenotype, and a limited number of features used for prediction.MethodsIn this research, we investigated the effect of feature selection and prediction algorithm on the performance of prediction method thoroughly. In particular, we considered the following feature selection and prediction methods: regression analysis, regularized regression analysis, linear discriminant analysis, non-linear support vector machine, and random forest. For these methods, we studied the effects of feature selection and the number of features on prediction. Our investigation was based on the analysis of 8,842 Korean individuals genotyped by Affymetrix SNP array 5.0, for predicting smoking behaviors.ResultsTo observe the effect of feature selection methods on prediction performance, selected features were used for prediction and area under the curve score was measured. For feature selection, the performances of support vector machine (SVM) and elastic-net (EN) showed better results than those of linear discriminant analysis (LDA), random forest (RF) and simple logistic regression (LR) methods. For prediction, SVM showed the best performance based on area under the curve score. With less than 100 SNPs, EN was the best prediction method while SVM was the best if over 400 SNPs were used for the prediction.ConclusionsBased on combination of feature selection and prediction methods, SVM showed the best performance in feature selection and prediction.

Highlights

  • A great success of the genome wide association study enabled us to give more attention on the personal genome and clinical application such as diagnosis and disease risk prediction

  • We defined three dichotomous traits such as smoking initiation ("never smoked defined as controls” vs. “former, occasional, or habitual smoker defined as cases”), CPD10 ("light smoking as controls with cigarettes smoked per day (CPD) ≤10” vs. “heavy smokers with CPD > 20 defined as cases”), and smoking cessation (SC) ("former smoker defined as controls” vs. “current smoker defined as cases”)

  • We compared the performance of the prediction methods such as logistic regression (LR), linear discriminant analysis (LDA), support vector machine (SVM), elastic net (EN) and random forest (RF)

Read more

Summary

Introduction

A great success of the genome wide association study enabled us to give more attention on the personal genome and clinical application such as diagnosis and disease risk prediction. There are several reasons for poor predictability such as small number of known disease-associated loci, simple analysis not considering complexity in phenotype, and a limited number of features used for prediction. The main goal of genome-wide association studies (GWAS) is to identify the complex phenotype associated loci. A great success of the GWAS leads us to move our focus on the application to personal genomics and clinical practice such as diagnosis, disease risk prediction and prevention. Personal genome data will be the baseline information for personal medical treatment and scientific research as the number of GWAS and genomics studies are growing. GWAS using personal genome data successfully have unveiled novel loci for common traits and Parkinson’s disease [2,3]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call