Large biodiversity datasets conform to Benford's law: Implications for assessing sampling heterogeneity

Judit K Szabo,Lucas Rodriguez Forti,Corey T Callaghan

doi:10.1016/j.biocon.2023.109982

Abstract

Inadequate sampling can cause biased estimates of species diversity, as species occurrence generally follows a log-normal distribution with a long tail. Understanding this sampling bias is fundamental to inform biodiversity conservation actions. However, currently available tests to assess data quality, such as fitting species abundance distribution (SAD) models and rarefaction curves are computationally costly and can still lead to erroneous conclusions.We evaluated Benford's law (first digit distribution) as a complementary method to assess data heterogeneity and survey coverage in large biodiversity datasets, including eBird data for 157 countries and three non-avian GBIF datasets. We also tested conformity to Benford's law of four simulated communities with different SAD models and four corrupted datasets with log-normal SAD. Finally, we evaluated the effect of including rare species in three datasets on the conformity to Benford's law and also compared Benford fit to the results of traditional methods to estimate survey completeness in seven datasets.Species-rich datasets with a large number of observations tended to obtain a good fit. Benford conformity can be a simple and sensitive measure of sampling evenness, complementing traditional methods to assess quality data in large-scale studies. Benford's test can reflect species abundance heterogeneity, especially in log-normally distributed data, but was not ideal to evaluate surveys completeness, as its results diverged from those of traditional methods.As the contribution of citizen science continues to increase in biodiversity monitoring, this fast and efficient method can play a critical role to assess the quality of datasets.

Full Text