Letter to the Editor: On the term 'interaction' and related phrases in the literature on Random Forests

A.-L Boulesteix,A Hapfelmeier,S Janitza,K Van Steen,C Strobl

doi:10.1093/bib/bbu012

Abstract

In an interesting and quite exhaustive review on Random Forests (RF) methodology in bioinformatics Touw et al. address—among other topics—the problem of the detection of interactions between variables based on RF methodology. We feel that some important statistical concepts, such as ‘interaction’, ‘conditional dependence’ or ‘correlation’, are sometimes employed inconsistently in the bioinformatics literature in general and in the literature on RF in particular. In this letter to the Editor, we aim to clarify some of the central statistical concepts and point out some confusing interpretations concerning RF given by Touw et al. and other authors.

Highlights

Random Forests (RF) is a valuable analysis tool, especially in situations where datasets contain many variables with complex relationships
We will give a consistent statistical definition of those concepts that are most central for understanding the rationale and behavior of RF
Our intention is not to impose our definitions on everyone but rather to provide a possible interpretation of the considered concepts that allows a better understanding of some aspects of RF methodology

Summary

INTRODUCTION

Random Forests (RF) is a valuable analysis tool, especially in situations where datasets contain many variables with complex relationships. This behavior is not outright wrong, because there are different concepts for judging the importance of a variable in the presence of associations/correlations among the predictor variables (see, for example [20]) It is not the behavior a user may expect when he/she is used to the partial or conditional behavior of the regression coefficients in (generalized) linear models that was outlined in the section ‘Interactions in regression models’. The marginal or unconditional view is inherent in the standard RF importance measure and in correlations between one predictor variable and the response variable without taking potential confounders into account This principle can again be illustrated by recalling the model formula for the logistic regression model with the two predictor variables X2 (trained staff) and X3 (clean hospital floors) which do not interact: logit1⁄2PðY 1⁄4 1jX2 1⁄4 x2,X3 1⁄4 x3Þ. Does it relate to the ability of RF to yield high individual VIMs for predictor variables involved in interactions [31], the possibility to directly identify which predictor variables interact with each other by examining a RF [32, 1], or the combination of RF with other analysis tools with the aim of identifying interactions [30]? In any case, when an algorithm based on RF (possibly combined with other tools) is suggested to identify which predictor variables interact with each other, we claim that this algorithm should be assessed in simulations using adequate measures such as, for example, sensitivity, the proportion of pairs of interacting variables that are correctly identified as interacting; specificity, the proportion of pairs of non-interacting variables that are correctly identified as non-interacting; or false positive rate, the proportion of pairs of noninteracting variables within the pairs identified as interacting

CONCLUSION

Key Points

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Briefings in Bioinformatics	Publication Date: Apr 9, 2014
Citations: 74	License type: cc-by

R Discovery Prime

R Discovery Prime

Letter to the Editor: On the term 'interaction' and related phrases in the literature on Random Forests

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Briefings in Bioinformatics

Lead the way for us

Similar Papers

Analysis and modeling conditional mutual dependency of metrics in software defect prediction using latent variables
Nima Shiri Harzevili ... Sasan H Alizadeh
Neurocomputing | VOL. 460
Nima Shiri Harzevili, et. al.Nima Shiri Harzevili ... Sasan H Alizadeh
24 Jul 2021
Neurocomputing | VOL. 460

Prognostic value of health-related quality of life in patients with metastatic pancreatic adenocarcinoma: a random forest methodology.
Momar Diouf ... Julien Taieb
Quality of Life Research | VOL. 25
Momar Diouf, et. al.Momar Diouf ... Julien Taieb
28 Nov 2015
Quality of Life Research | VOL. 25

Random Forests for Ordinal Response Data: Prediction and Variable Selection
...
-
, et. al. ...
01 Dec 2014
01 Dec 2014

Consistency of random forests
Erwan Scornet ... Jean-Philippe Vert
The Annals of Statistics | VOL. 43
Erwan Scornet, et. al.Erwan Scornet ... Jean-Philippe Vert
01 Aug 2015
The Annals of Statistics | VOL. 43

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Letter to the Editor: On the term 'interaction' and related phrases in the literature on Random Forests

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Briefings in Bioinformatics