Abstract

Various methods exist to model a species’ niche and geographic distribution using environmental data for the study region and occurrence localities documenting the species’ presence (typically from museums and herbaria). In presence-only modelling, geographic sampling bias and small sample sizes represent challenges for many species. Overfitting to the bias and/or noise characteristic of such datasets can seriously compromise model generality and transferability, which are critical to many current applications – including studies of invasive species, the effects of climatic change, and niche evolution. Even when transferability is not necessary, applications to many areas, including conservation biology, macroecology, and zoonotic diseases, require models that are not overfit. We evaluated these issues using a maximum entropy approach (Maxent) for the shrew Cryptotis meridensis, which is endemic to the Cordillera de Mérida in Venezuela. To simulate strong sampling bias, we divided localities into two datasets: those from a portion of the species’ range that has seen high sampling effort (for model calibration) and those from other areas of the species’ range, where less sampling has occurred (for model evaluation). Before modelling, we assessed the climatic values of localities in the two datasets to determine whether any environmental bias accompanies the geographic bias. Then, to identify optimal levels of model complexity (and minimize overfitting), we made models and tuned model settings, comparing performance with that achieved using default settings. We randomly selected localities for model calibration (sets of 5, 10, 15, and 20 localities) and varied the level of model complexity considered (linear versus both linear and quadratic features) and two aspects of the strength of protection against overfitting (regularization). Environmental bias indeed corresponded to the geographic bias between datasets, with differences in median and observed range (minima and/or maxima) for some variables. Model performance varied greatly according to the level of regularization. Intermediate regularization consistently led to the best models, with decreased performance at low and generally at high regularization. Optimal levels of regularization differed between sample-size-dependent and sample-size-independent approaches, but both reached similar levels of maximal performance. In several cases, the optimal regularization value was different from (usually higher than) the default one. Models calibrated with both linear and quadratic features outperformed those made with just linear features. Results were remarkably consistent across the examined sample sizes. Models made with few and biased localities achieved high predictive ability when appropriate regularization was employed and optimal model complexity was identified. Species-specific tuning of model settings can have great benefits over the use of default settings.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.