Abstract

Smoking is a complex behavior with a heritability as high as 50%. Given such a large genetic contribution, it provides an opportunity to prevent those individuals who are susceptible to smoking dependence from ever starting to smoke by predicting their inherited predisposition with their genomic profiles. Although previous studies have identified many susceptibility variants for smoking, they have limited power to predict smoking behavior. We applied the support vector machine (SVM) and random forest (RF) methods to build prediction models for smoking behavior. We first used 1,431 smokers and 1,503 non-smokers of African origin for model building with a 10-fold cross-validation and then tested the prediction models on an independent dataset consisting of 213 smokers and 224 non-smokers. The SVM model with 500 top single nucleotide polymorphisms (SNPs) selected using logistic regression (p<0.01) as the feature selection method achieved an area under the curve (AUC) of 0.691, 0.721, and 0.720 for the training, test, and independent test samples, respectively. The RF model with 500 top SNPs selected using logistic regression (p<0.01) achieved AUCs of 0.671, 0.665, and 0.667 for the training, test, and independent test samples, respectively. Finally, we used the combined logistic (p<0.01) and LASSO (λ=10−3) regression to select features and the SVM algorithm for model building. The SVM model with 500 top SNPs achieved AUCs of 0.756, 0.776, and 0.897 for the training, test, and independent test samples, respectively. We conclude that machine learning methods are promising means to build predictive models for smoking.

Highlights

  • Tobacco smoking is one of the most important public health problems throughout the world [1]

  • After completing all machine learning processes, we found that the support vector machine (SVM) model with the combined feature selection approach of both logistic regression and least absolute shrinkage and selection operator (LASSO) regression appeared to be better than the models using only one method for both the test and independent test samples regardless of the number of single nucleotide polymorphisms (SNPs) included in each model (Table 4)

  • Given the results obtained from this series of parameter selections and machine learning methods, we concluded that the SVM model with the combined logistic regression (P < 0.01) and LASSO regression (l = 10−3) as the feature selection method represented the best approach of developing our prediction model for the datasets used in this study

Read more

Summary

Introduction

Tobacco smoking is one of the most important public health problems throughout the world [1]. According to a World Health Organization report, the number of deaths caused by tobacco smoking will reach 10 million worldwide annually by 2020 [2]. Without significant efforts to limit tobacco smoking, this number will rise to 8.3 million by 2030 [3]. Prevention of smoking initiation has become a critical step in tobacco control [4,5,6,7]. Stopping individuals susceptible to nicotine dependence from starting to smoke represents an effective way to achieve tobacco control. Tobacco smoking is a complex and multifactorial behavior determined by both genetic and environmental factors, as well as by gene-by-gene and gene-by-environmental interactions [8, 9]. It is feasible to predict an individual's inherited predisposition to smoking on the basis of the genomic profile

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call