Abstract

Background and aims: The National Survey on Drug Use and Health (NSDUH) contains a large number of responses and many features. This study aims to identify features from within NSDUH that are important in classifying heroin use. Proper implementation of random forest (RF) techniques copes with the highly imbalanced nature of heroin usage among respondents to identify features that are prominent in classification models involving nonlinear combinations of predictive variables. To date, methods for the proper application of RF to imbalanced medical datasets have not been defined. Methods: Three different RF classification techniques are applied to the 2016 NSDUH. The techniques are compared using scoring criteria, including area under the precision recall curve (AUPRC), to identify the best model. Variable importance scores (VIS) are checked for stability across the three models and the VIS from the best model are used to highlight features and categories of features that most influence the classification of heroin users. Findings: The best performing method was RF with random oversampling (AUPRC = 0.5437). The category of features regarding other drug use was most important (average z-scored VIS = 1.66) followed by age-of-first-use features (0.32). The most important individual feature was cocaine usage (z-scored VIS = 11.05), followed by crack usage (6.51). The most important individual feature other than specific drug use flags was the use of marijuana under the age of 18 (3.11). This study demonstrates a method for the use of RF in feature extraction from imbalanced medical datasets with many predictors.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call