Is more data always better? A simulation study of benefits and limitations of integrated distribution models

Emily G Simmonds,Robert B O'Hara,Peter A Henrys,Susan G Jarvis,Nick J B Isaac

doi:10.1111/ecog.05146

Emily G Simmonds, Robert B O'Hara + Show 3 more

Open Access

https://doi.org/10.1111/ecog.05146

Copy DOI

Abstract

Species distribution models are popular and widely applied ecological tools. Recent increases in data availability have led to opportunities and challenges for species distribution modelling. Each data source has different qualities, determined by how it was collected. As several data sources can inform on a single species, ecologists have often analysed just one of the data sources, but this loses information, as some data sources are discarded. Integrated distribution models (IDMs) were developed to enable inclusion of multiple datasets in a single model, whilst accounting for different data collection protocols. This is advantageous because it allows efficient use of all data available, can improve estimation and account for biases in data collection. What is not yet known is when integrating different data sources does not bring advantages. Here, for the first time, we explore the potential limits of IDMs using a simulation study integrating a spatially biased, opportunistic, presence‐only dataset with a structured, presence–absence dataset. We explore four scenarios based on real ecological problems; small sample sizes, low levels of detection probability, correlations between covariates and a lack of knowledge of the drivers of bias in data collection. For each scenario we ask; do we see improvements in parameter estimation or the accuracy of spatial pattern prediction in the IDM versus modelling either data source alone? We found integration alone was unable to correct for spatial bias in presence‐only data. Including a covariate to explain bias or adding a flexible spatial term improved IDM performance beyond single dataset models, with the models including a flexible spatial term producing the most accurate and robust estimates. Increasing the sample size of presence–absence data and having no correlated covariates also improved estimation. These results demonstrate under which conditions integrated models provide benefits over modelling single data sources.

Highlights

Species distribution modelling has many applications in ecology and is a mature discipline
Both the Integrated distribution models (IDMs) with a bias covariate and IDM with a second spatial field showed improvements in performance with more PA data. These integrated models performed better than PO-only models including the bias covariate, but the improvement was quite small at low levels of PA data
Our simulation study investigated whether IDMs always performed better than single models of PO and PA data under a range of scenarios

Summary

Introduction

Species distribution modelling has many applications in ecology and is a mature discipline. Data mobilization, citizen science and a raft of new monitoring technologies have generated enormous growth in the data available for such models Whilst these new data streams are welcome, they present challenges for species distribution modelling because each data source has different attributes, reflecting variation in protocols, spatial extent, sampling intensity and the time period over which they were collected. Confronted by this heterogeneity, modellers commonly face a choice over which data sources to use for a particular application. This is usually achieved by sharing parameters between datasets, often by treating each data source as a separate realisation of the true distribution (the ‘joint-likelihood approach’ (Pacifici et al 2017))

Methods

Results

Conclusion