Techniques to Improve Ecological Interpretability of Black-Box Machine Learning Models

Thomas Welchowski,Kelly O Maloney,Richard Mitchell,Matthias Schmid

doi:10.1007/s13253-021-00479-7

Abstract

Statistical modeling of ecological data is often faced with a large number of variables as well as possible nonlinear relationships and higher-order interaction effects. Gradient boosted trees (GBT) have been successful in addressing these issues and have shown a good predictive performance in modeling nonlinear relationships, in particular in classification settings with a categorical response variable. They also tend to be robust against outliers. However, their black-box nature makes it difficult to interpret these models. We introduce several recently developed statistical tools to the environmental research community in order to advance interpretation of these black-box models. To analyze the properties of the tools, we applied gradient boosted trees to investigate biological health of streams within the contiguous USA, as measured by a benthic macroinvertebrate biotic index. Based on these data and a simulation study, we demonstrate the advantages and limitations of partial dependence plots (PDP), individual conditional expectation (ICE) curves and accumulated local effects (ALE) in their ability to identify covariate–response relationships. Additionally, interaction effects were quantified according to interaction strength (IAS) and Friedman’s H^2 statistic. Interpretable machine learning techniques are useful tools to open the black-box of gradient boosted trees in the environmental sciences. This finding is supported by our case study on the effect of impervious surface on the benthic condition, which agrees with previous results in the literature. Overall, the most important variables were ecoregion, bed stability, watershed area, riparian vegetation and catchment slope. These variables were also present in most identified interaction effects. In conclusion, graphical tools (PDP, ICE, ALE) enable visualization and easier interpretation of GBT but should be supported by analytical statistical measures. Future methodological research is needed to investigate the properties of interaction tests. Supplementary materials accompanying this paper appear on-line.

Highlights

Ecological data are often high-dimensional and exhibit complex interactions among variables
We used data taken as part of the US Environmental Protection Agency’s (EPA) 2008–2009 US National Rivers and Streams Assessment (NRSA) Survey (EPA 2016a), which sampled streams across the conterminous USA to assess physiochemical and biological condition based on a generalized random tessellation stratified design (Stevens and Olsen 2004)
We used the benthic multimetric index (MMI) condition class, which evaluates stream condition based on the benthic macroinvertebrate community (Stribling and Dressing 2015)

Summary

Introduction

Ecological data are often high-dimensional and exhibit complex interactions among variables. Higher interactions may become a problem for supervised statistical modeling techniques that investigate the effects of additive combinations of a set of covariates on a response variable of interest. Model fitting tends to be robust toward outliers in the covariates (Breiman et al 1984), and the overall “strengths” of the covariate effects can be assessed using variable importance measures (Fisher et al 2016). By definition, these measures estimate how much a covariate influences the prediction error compared to a situation where the covariate has no effect under a given model

Objectives

Results

Conclusion