Abstract
Feature selection is essentially the process of picking informative and relevant features from a larger collection of features. Few studies have focused on predictors for current e-cigarette use among U.S. adults using feature selection and machine learning (ML) approaches. This study aimed to perform feature selection and develop ML approaches in prediction of current e-cigarette use using the 2022 Health Information National Trends Survey (HINTS 6). The Boruta algorithm and the least absolute shrinkage and selection operator (LASSO) were used to perform feature selection of 71 variables. The random oversampling example (ROSE) method was utilized to deal with imbalance data. Five ML tools including support vector machines (SVMs), logistic regression (LR), random forest (RF), gradient boosting machine (GBM), and extreme gradient boosting (XGBoost) were applied to develop ML models. The overall prevalence of current e-cigarette use was 4.3%. Using the overlapped 15 variables selected by Boruta and LASSO, the RF algorithm provided the best classifier with an accuracy of 0.992, sensitivity of 0.985, F1 score of 0.991, and AUC of 0.999. Weighted logistic regression further confirmed that age, education level, smoking status, belief in the harm of e-cigarette use, binge drinking, belief in alcohol increasing cancer, and the Patient Health Questionnaire-4 (PHQ4) score were associated with e-cigarette use. This study confirmed the strength of ML techniques in survey data, and the findings will guide inquiry into behaviors and mentalities of substance users.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have