Sample size for the evaluation of presence-absence models

Alberto Jiménez-Valverde

doi:10.1016/j.ecolind.2020.106289

Alberto Jiménez-Valverde

Open Access

https://doi.org/10.1016/j.ecolind.2020.106289

Copy DOI

Journal: Ecological indicators	Publication Date: Mar 14, 2020
Citations: 30	License type: cc-by-nc-nd

Affiliation: University of Alcalá

Abstract

The effect of the training dataset sample size has been shown to have profound outcomes on the performance of species distribution models. However, the effects that the testing dataset sample size can have on the assessment of a models predictive capacity has received little attention. In this study, I used simulations to study how accurate two discrimination statics, the AUC (the area under the receiver operating characteristic – ROC – curve) and Se* (the probability of correctly classifying any case and calculated from the threshold that makes minimum the difference between sensitivity and specificity), are estimated based on sample size. ROC curves with known discrimination ability were simulated, samples were randomly taken, the two discrimination statistics were estimated, and the differences between the two estimators and their respective true values were computed to understand how bias and precision were affected by sample size. In general, as sample size increases, the difference between reported and true discrimination capacity decreased. There were no important differences between the estimated AUC and Se* statistics in terms of bias and precision. Under realistic scenarios where the ROC points are not necessarily part of the true underlying ROC curve, the two discrimination statistics are both unbiased and equally precise, and the higher the true discrimination capacity is, the more accurate they are estimated. Between 20 and 30 is a lowest sample size limit since below this interval accuracy estimates considerably decreases. All together, these results are very important since many interesting SDM applications involve rare and poorly known species for which sample sizes are unavoidably small.

Full Text