In pursuit of a more robust provenance in the field of species distribution modelling, an extensive literature search was undertaken to find the typical default values, and the range of values, for configuration settings of a large number of the most commonly used statistical algorithms available for constructing species distribution models (SDM), as implemented in the R script packages (such as Dismo and Biomod2) or other species distribution modelling programs like MaxEnt. We found that documentation of SDM algorithm configuration option settings in the SDM literature is, overall, very uncommon, and the justifications for these settings were minimal, when present. Such settings were often the R default values, or were the result of trial and error. This is potentially concerning since: (i) it detracts from the robustness of the provenance for such SDM studies; (ii) a lack of documentation of configuration option settings in a paper prevents the replication of an experiment, which contravenes one of the main tenets of the scientific method; (iii) inappropriate or uninformed configuration option settings are particularly concerning if they represent a poorly understood ecological variable or process, and if the algorithm is sensitive to such settings, this could result in erroneous and/or unrealistic SDMs. Therefore, this study sets out to comprehensively test the sensitivity of eight widely used SDM algorithms to variation in configuration options settings: MaxEnt, Artificial Neural Network (ANN), Generalized Linear Model (GLM), Generalized Additive Model (GAM), Multivariate Adaptive Regression Splines (MARS), Flexible Discriminant Analysis (FDA), Surface Range Envelope (SRE) and Classification tree analysis (CTA). A process of expert elicitation was used to derive a range of appropriate values with which to test the sensitivity of our algorithms. We chose to use species occurrence records for two species - Koala (Phascolartos cinereus) and Thorny Devil (Moloch horridus) - in order to investigate how algorithm sensitivity depends on the species being modelled. Results were assessed by comparing the modelled distribution of the control SDM (default settings) to the modelled distribution from each sensitivity test SDM (i.e. non-default configuration settings). This was done using the visual and statistical measures of predictive performance available in the Biodiversity and Climate Change Virtual Laboratory (BCCVL), including the area under the (receiver operating characteristic) curve. The aim of our study was to be able to draw conclusions as to how the sensitivity of SDM algorithms to their configuration option settings may detract from the reliability of SDM results, given the often unjustified and unscrutinized use of the default settings, and generally infrequent and largely perfunctory attendance to this issue in most of the published SDM literature. Our results indicate that all of the algorithms tested showed sensitivity to alternative (non-default) values for some of their configuration settings and that often this sensitivity is species-dependent. Therefore we can conclude that the choice of configuration settings in these widely used SDM algorithms can have a large impact on the resulting projected distribution. This has important ramifications for decision-making and policy outcomes wherever SDMs are used to inform species and biodiversity management plans and policy settings. This study demonstrates that assigning suitable values for these settings is a very important consideration and as such should always be published along with the model. Documenting all configuration settings is necessary to increase the scientific robustness, transparency and reproducibility of species distribution modelling studies.
Read full abstract