Abstract

AbstractPresence/absence data and presence‐only data are the two customary sources for learning about species distributions over a region. We present an ambitious agenda with regard to the analysis of such data. We illuminate the fundamental modeling differences between the two types of data. Most simply, locations are considered to be fixed under presence/absence data; locations are random under presence‐only data. The definition of “probability of presence” is incompatible between the two. We are not comfortable with modeling strategies in the literature that ignore this incompatibility and that assume that presence/absence modeling can be induced from presence‐only specifications and, therefore, that fusion of presence‐only and presence/absence data sources is routine. While, in some cases, data collection may not support this, we propose that, since, in nature, presence/absence is seen at the point locations, presence/absence data should be modeled at point level. If so, we need to specify two surfaces. The first provides the probability of presence at any location in the region. The second provides a realization from this surface in the form of a binary map yielding the results of Bernoulli trials across all locations; this surface is only partially observed. On the other hand, presence‐only data should be modeled as a (partially observed) point pattern, arising from a random number of individuals seen at random locations, driven by specification of an intensity function. There is no notion of Bernoulli trials; events are associated with areas. We further suggest that, with just presence/absence data, preferential sampling of locations may arise. Accounting for this, using a shared process perspective, can improve our estimated presence/absence surface as well as prediction of presence. We further propose that preferential sampling can enable a probabilistically coherent fusion of the two data types. We illustrate with two real data sets, one presence/absence, one presence‐only, for invasive species presence in New England in the United States. We demonstrate that potential bias in sampling locations can affect inference with regard to presence/absence and show that inference can be improved with preferential sampling ideas. We also provide a probabilistically coherent fusion of the two data sets again with the goal of improving inference for presence/absence. The importance of our work is to encourage more careful modeling when studying species distributions. Ignoring incompatibility between data types and adopting nongenerative modeling specifications results in invalid inference; the quantitative ecological community should benefit from this recognition.

Highlights

  • Learning about species distributions is, arguably, a preoccupation in the ecology community

  • The fact that presence/absence is not observable at point level does not preclude useful point level modeling. This is the case with all geostatistical modeling (Banerjee et al, 2014), e.g., temperature is never observed at a unitless location but we routinely model temperature surfaces

  • The issue is whether presence/absence is viewed at point level or at areal level. Is it a Bernoulli trial at a location or is it the probability that the number of individuals of a species in set A is ≥ 1? If we model presence/absence at point level, it is clear what Y (s) = 1 means but what does Y (A) mean? A coherent probabilistic definition arises as a block average, i.e., a realization of Y (A) is A 1(Y (s) = 1)ds/|A|, the proportion of the Y (s) in A that equal 1; it is not a Bernoulli trial and P (Y (A) = 1) = 0

Read more

Summary

Introduction

Learning about species distributions is, arguably, a preoccupation in the ecology community. The literature discusses two types of data collection to learn about species distributions: presence/absence and presence-only. The former imagines some version of designed sampling where say plots (grid cells, quadrats, etc.) are sampled and presence/absence or abundance of a species is observed for the sampled plots. Under point-level modeling for such data, we bring in preferential sampling ideas to clarify how potential bias in sampling locations can affect inference with regard to presence/absence. We propose a probabilistically coherent fusion, again employing the shared process perspective for implementing the fusion, extending application of preferential sampling This allows the two data sources to be probabilistically independent or dependent. This perspective enables a collection of models to take presence/absence modeling to a much richer explanatory level. In order to go forward, we first need some preliminary words regarding what a presence/absence event means.

The fundamental issue
Our motivating dataset
What does “probability of presence” mean?
What is preferential sampling all about?
Spatial modeling of presence-only data in practice
Some explicit modeling details
Model fitting and inference for data fusion
Summary and future work
Further exploratory analysis of the data
Model fitting details
Model fitting for partially observed presence-only data
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call