Abstract

Tree-based models are a popular tool for predicting a response given a set of explanatory variables when the regression function is characterized by a certain degree of complexity. Sometimes, they are also used to identify important variables and for variable selection. We show that if the generating model contains chains of direct and indirect effects, then the typical variable importance measures suggest selecting as important mainly the background variables, which have a strong indirect effect, disregarding the variables that directly influence the response. This is attributable mainly to the variable choice in the first steps of the algorithm selecting the splitting variable and to the greedy nature of such search. This pitfall could be relevant when using tree-based algorithms for understanding the underlying generating process, for population segmentation and for causalinference.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call