Abstract
Variable selection is a common technique to identify the most predictive variables from a pool of candidate predictors. Low prevalence predictors (LPPs) are frequently found in clinical data, yet few studies have explored their impact on model performance during variable selection. This study compared the Random Forest (RF) algorithm and stepwise regression (SWR) for variable selection using data from a paediatric sepsis screening tool, where 18 out of 32 predictors had a prevalence < 10%. Variable selection using RF was compared to forward and backward SWR. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), and the variables retained. Additionally, a simulation study assessed how increasing the prevalence of the predictors impacted the variable selection results. The best fitting RF and SWR models retained were 22, and 17 predictors, respectively, with 14 and 10 predictors having a prevalence < 10%. Both the RF and SWR models had similar predictive performance (RF: AUC [95% Confidence Interval] 0.79 [0.77, 0.81], LR: 0.80 [0.78, 0.82]). The simulation study revealed differences for both RF and SWR models in variable importance rankings and predictor selection with increasing prevalence thresholds, particularly for moderately and strongly associated predictors. The RF algorithm retained a number of very low prevalence predictors compared to SWR. However, the predictive performance of both models were comparable, demonstrating that when applied correctly and the number of candidate predictors is small, both methods are suitable for variable selection when using low prevalence predictors.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have