Abstract Introduction: Among the Hispanic/Latinx population in the US, four of five leading causes of death are attributed to smoking. Few evidence-based interventions have been developed to improve smoking cessation for this population. Our team has demonstrated the efficacy of a culturally targeted, extended self-help intervention for Spanish-speaking smokers, estimating a smoking abstinence rate of 33% at 24 months, compared to 24% for usual care. Further efforts are needed to enhance this efficacious, low-cost, scalable intervention approach. Machine learning (ML) is one approach to increase understanding of the predictors of treatment outcomes and inform strategies to improve the intervention. Toward that goal, this secondary data analysis utilized the decision tree (DT) model to predict self-reported 7-day point prevalence abstinence at 18-month follow-up for participants receiving our intervention. Method: Data from participants who reported smoking status at an 18-month follow-up (N=332 of 714 enrolled, 36% abstinent) were randomly split (80:20) into training and test datasets based on smoking status. We entered demographics, psychosocial, and smoking variables collected at baseline as predictors in the model. In addition to handling missing values and normalizing numeric features, we employed the Synthetic Minority Over-sampling Technique (SMOTE) combined with Tomek links (SMOTE-Tomek) to improve inter-class separability. Recursive Feature Elimination (RFE) using DT was implemented to identify the most relevant predictors. The cross-validation (CV) pipeline incorporated preprocessing, feature selection, handling class-imbalance, over-sampling, and model training with a DT as the interpretable classifier. Hyperparameters were tuned using stratified K-fold (K=5) CV to optimize parameters using Grid search. F1 score was selected as the primary metric as it accounts for each class performance by combining the precision and recall metrics. Finally, the performance of the classifier was evaluated in 20% of the unseen dataset. Results: The DT classifier for the CV achieved the F1 score of 0.73 [95% CI 0.66, 0.77], revealing a reasonably good performance in identifying smokers at 18-months. RFE selected familism, age, affect, and confidence as the most relevant predictors of smoking status. Individuals who had lower familism scores (<138), combined with higher age (>50), were more likely to be smokers at 18 months. Individuals who had higher familism scores (>138) combined with higher negative affect (>42) were more likely to be smokers. Conclusion: This study provides the first step toward personalized care for smoking cessation among the Hispanic/Latinx population and demonstrates the potential of DT classifier to predict cessation outcomes among individuals who received and completed a culturally and linguistically targeted smoking cessation intervention. Our future analysis will compare this model with conventional statistical modeling approaches and other ML algorithms to identify the best-performing parsimonious model. Citation Format: Ranjita Poudel, K. Ruwani M. Fernando, Matthew B. Schabath, Steven K. Sutton, Thomas H. Brandon, Issam El Naqa, Vani N. Simmons. A machine learning approach to predicting smoking cessation outcomes among Spanish-speaking smokers who completed a culturally targeted intervention [abstract]. In: Proceedings of the 17th AACR Conference on the Science of Cancer Health Disparities in Racial/Ethnic Minorities and the Medically Underserved; 2024 Sep 21-24; Los Angeles, CA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2024;33(9 Suppl):Abstract nr B019.
Read full abstract