Abstract

The purpose of this study is to increase the number of species occurrence data by integrating opportunistic data with Global Biodiversity Information Facility (GBIF) benchmark data via a novel optimization technique. The optimization method utilizes Natural Language Processing (NLP) and a simulated annealing (SA) algorithm to maximize the average likelihood of species occurrence in maximum entropy presence-only species distribution models (SDM). We applied the Kruskal–Wallis test to assess the differences between the corresponding environmental variables and habitat suitability indices (HSI) among datasets, including data from GBIF, Facebook (FB), and data from optimally selected FB data. To quantify uncertainty in SDM predictions, and to quantify the efficacy of the proposed optimization procedure, we used a bootstrapping approach to generate 1000 subsets from five different datasets: (1) GBIF; (2) FB; (3) GBIF plus FB; (4) GBIF plus optimally selected FB; and (5) GBIF plus randomly selected FB. We compared the performance of simulated species distributions based on each of the above subsets via the area under the curve (AUC) of the receiver operating characteristic (ROC). We also performed correlation analysis between the average benchmark-based SDM outputs and the average dataset-based SDM outputs. Median AUCs of SDMs based on the dataset that combined benchmark GBIF data and optimally selected FB data were generally higher than the AUCs of other datasets, indicating the effectiveness of the optimization procedure. Our results suggest that the proposed approach increases the quality and quantity of data by effectively extracting opportunistic data from large unstructured datasets with respect to benchmark data.

Highlights

  • Improving both the quality and quantity of species occurrence data is crucial for biological monitoring and species distribution modeling (SDM) in the investigation of biodiversity [1,2,3,4]

  • We focus on opportunistic data collected in Taiwan from the EnjoyMoths project’s social media Facebook (FB) page [36]

  • Opportunistic data can provide ecologists with additional samples to compensate for data gaps that may exist in the relatively small number of professionally collected, high-quality structured samples available from other sources

Read more

Summary

Introduction

Improving both the quality and quantity of species occurrence data is crucial for biological monitoring and species distribution modeling (SDM) in the investigation of biodiversity [1,2,3,4]. Professionally collected data are the preferred data source for SDM, they are expensive to collect and are often in short supply. Data collected using proper crowdsourcing techniques, often termed “opportunistic data” [3,4,5,6,7,8,9,10,11,12] or unstructured volunteer data, can provide ecologists with a variety of biodiversity monitoring data. Volunteer-based citizen science monitoring systems have attracted a lot of attention. Even professionally curated databases, which include portals for citizen scientists and increase the amount of structured data available for research, lack adequate coverage of species occurrence.

Objectives
Methods
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call