Abstract

Abstract Background Advances in causal inference have helped explain the longstanding birthweight and obesity paradoxes: selection bias due to conditioning on a collider variable i.e. collider-stratification bias (CSB). The lessons learned have critical implications for the interpretation of machine learning (ML), including decision trees and random forests (RFs), that implicitly condition on input variables. RFs are a popular approach for identifying important “predictors” from large data through variable importance, defined by the average decrease in prediction accuracy. While CSB has become a recognized concern when estimating exposure-outcome effects, knowledge of its impact on ML’s variable importance measures (VIMs) is limited. Applying the causal inference framework, we investigated the accuracy of RFs’ VIMs in data-mechanisms prone to CSB. Methods A Monte Carlo simulation study was conducted, with binary outcome and collider variables generated from logistic models. Two exposure variables stochastically determined the outcome and a collider variable, independent of the outcome. VIMs from RFs were compared to the known causal relevance of the input variables on the outcome. Results While variable importance of true exposure variables was not systematically affected by CSB, validity of VIMs can be affected, leading to erroneous selection of collider variables, causally independent of the outcome, as outcome predictors. Conclusions In presence of CSB, VIMs are not valid measures of the causal relevance of variables and may mislead selection of truly important factors that affect the outcome. Key messages ML must consider causal data-generating mechanisms otherwise it may lead to erroneous assessment of variable importance regarding outcome prediction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call