Cubification of Biodiversity Data: FAIRiCUBE and the European Habitat Classification System

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

European habitats are classified under a framework developed by the European Topic Centre for Biodiversity for the European Environment Agency, as part of the European Nature Information System (EUNIS) (Davies et al. 2004). All terrestrial, freshwater, and marine habitats follow a hierarchical classification based on physical features, human influence, and dominant vegetation (Moss 2008, Chytrý et al. 2020). Distribution maps are provided and modelled using occurrence data of indicator species collected from vegetation surveys (Hennekens 2017). Although the system may seem accurate, when we first plotted the distribution of the main species of our habitat study case, EUNIS Habitat S22 ‘Alpine and subalpine ericoid heath’ (European Environment Agency 2019), we observed that occurrence data, e.g., from sources like the Global Biodiversity Information Facility (GBIF), often fell outside the mapped areas of the habitat. Furthermore, important occurrence data sources, such as herbaria, were left out of the official distribution mapping, representing, in our view, a significant shortcoming of the EUNIS system. This study addresses these gaps by integrating diverse sources of in situ occurrence data (herbaria, vegetation surveys, citizen science) through a machine learning approach to complement the current EUNIS mapping. Specifically, we modelled the distributions of diagnostic species of the Habitat S22, using species distribution models (SDMs). For this purpose, we retrieved occurrence data from GBIF, identified by the accepted names as well as taxonomic synonyms, using the R package rgbif (Chamberlain et al. 2025), and utilised the Darwin Core (Wieczorek et al. 2012) standard. Data were filtered to include European occurrences with spatial coordinates and uncertainty of <500 m, and only spring and summer months of 1980–2024. For modelling itself, they were stratified into a 1-km grid. As SDM predictors, we used proxies for macroclimate and topography. Climatic predictors included CHELSA Bioclim variables of mean annual temperature, temperature seasonality, annual precipitation, precipitation seasonality, and an aridity index (Zomer et al. 2022). For topography, we used the digital terrain model, Copernicus, and calculated slope and indices for heat load (McCune and Keon 2002), topographical ruggedness (Riley et al. 1999), and topographical wetness (Beven and Kirkby 1979), using the spatialEco R package (Evans and Murphy 2021) and SAGA GIS (Conrad et al. 2015). Data were integrated into data cubes, and correlations among species occurrences and predictors were tested. We supplemented the occurrence data with pseudo-absences sampled within a buffer around presence points (Fallgatter et al. 2025). We fitted ensemble SDMs weighted by true-skill statistics scores based on independent cross-validation. We modelled two spatial resolutions in two regions: continental Europe at 1-km resolution, and the European Alps at 100-m resolution. continental Europe at 1-km resolution, and the European Alps at 100-m resolution. Predicted species distributions were aggregated into cumulative distribution maps. Those were further validated by overlapping them with the distribution of the habitat based on vegetation plots classified by an expert system as provided by the European Vegetation Archive (EVA) plots at 1-km resolution. Predictions were also compared with the official EUNIS probability map for Habitat S22. Correlation analyses confirmed the ecological features of the Habitat S22 indicated by the EUNIS classification. Our modelled ranges largely overlapped with the distribution of EVA plots and the EUNIS probability map, but also revealed mismatches at lower elevations and in the Scandinavian region. These differences decreased when fewer species were combined in cumulative predictions. Our findings show that SDMs based on occurrence data from different sources can validate and refine expert-defined habitat maps, offering a complementary and data-driven approach.

Similar Papers
  • Peer Review Report
  • Cite Count Icon 25
  • 10.7554/elife.04395.017
Author response: Mapping the zoonotic niche of Ebola virus disease in Africa
  • Aug 28, 2014
  • David M Pigott + 18 more

Ebola virus disease (EVD) is a complex zoonosis that is highly virulent in humans. The largest recorded outbreak of EVD is ongoing in West Africa, outside of its previously reported and predicted niche. We assembled location data on all recorded zoonotic transmission to humans and Ebola virus infection in bats and primates (1976–2014). Using species distribution models, these occurrence data were paired with environmental covariates to predict a zoonotic transmission niche covering 22 countries across Central and West Africa. Vegetation, elevation, temperature, evapotranspiration, and suspected reservoir bat distributions define this relationship. At-risk areas are inhabited by 22 million people; however, the rarity of human outbreaks emphasises the very low probability of transmission to humans. Increasing population sizes and international connectivity by air since the first detection of EVD in 1976 suggest that the dynamics of human-to-human secondary transmission in contemporary outbreaks will be very different to those of the past.DOI: http://dx.doi.org/10.7554/eLife.04395.001

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 28
  • 10.3390/rs13071231
Earth Observation and Biodiversity Big Data for Forest Habitat Types Classification and Mapping
  • Mar 24, 2021
  • Remote Sensing
  • Emiliano Agrillo + 7 more

In the light of the “Biological Diversity” concept, habitats are cardinal pieces for biodiversity quantitative estimation at a local and global scale. In Europe EUNIS (European Nature Information System) is a system tool for habitat identification and assessment. Earth Observation (EO) data, which are acquired by satellite sensors, offer new opportunities for environmental sciences and they are revolutionizing the methodologies applied. These are providing unprecedented insights for habitat monitoring and for evaluating the Sustainable Development Goals (SDGs) indicators. This paper shows the results of a novel approach for a spatially explicit habitat mapping in Italy at a national scale, using a supervised machine learning model (SMLM), through the combination of vegetation plot database (as response variable), and both spectral and environmental predictors. The procedure integrates forest habitat data in Italy from the European Vegetation Archive (EVA), with Sentinel-2 imagery processing (vegetation indices time series, spectral indices, and single bands spectral signals) and environmental data variables (i.e., climatic and topographic), to parameterize a Random Forests (RF) classifier. The obtained results classify 24 forest habitats according to the EUNIS III level: 12 broadleaved deciduous (T1), 4 broadleaved evergreen (T2) and eight needleleaved forest habitats (T3), and achieved an overall accuracy of 87% at the EUNIS II level classes (T1, T2, T3), and an overall accuracy of 76.14% at the EUNIS III level. The highest overall accuracy value was obtained for the broadleaved evergreen forest equal to 91%, followed by 76% and 68% for needleleaved and broadleaved deciduous habitat forests, respectively. The results of the proposed methodology open the way to increase the EUNIS habitat categories to be mapped together with their geographical extent, and to test different semi-supervised machine learning algorithms and ensemble modelling methods.

  • Dataset
  • 10.23708/rngs8z
Global maps of habitat suitability probability for 1,485 European endemic plant species
  • Oct 26, 2020
  • Robin Pouteau

This dataset includes 1,485 raster files (.gri format) representing global maps of habitat suitability probability for the most widespread European endemic plant species. 272 species are already recorded as naturalized outside Europe and 1,213 species are not yet recorded as naturalized outside Europe but might become so in the future depending on their habitat suitability probabilities. The spatial resolution is 0.4166667° × 0.4166667°. The geographic coordinate system is World Geodetic System 1984 (EPSG: 4326). To comprehensively describe the distribution of the species in Europe, we combined occurrence records from six sources: the ‘Global Biodiversity Information Facility’ (GBIF), the ‘European Vegetation Archive’ (EVA), the ‘EU-Forest’ dataset, the ‘Atlas Florae Europaeae’, the ‘Plant Functional Diversity of Grasslands’ network (DIVGRASS) and the digital atlas of the German flora. When several occurrence records from these different sources were duplicated on the same cell, only one occurrence record per species was kept to avoid pseudoreplication. We defined six environmental variables to model and project species expected ranges: annual mean temperature (°C), annual precipitation (mm), precipitation seasonality (yearly coefficient of variation) representing the period 1979-2013 provided by the CHELSA climate database, the percentage of each grid cell with primary land cover based on the Harmonized Global Land Use models, organic carbon content (g per kg) and pH in the first 15 cm of soil from the global gridded soil information database SoilGrids. Environmental variables were aggregated (using the mean value) to the resolution of 0.42° × 0.42°. Six species distribution modelling (SDM) methods including generalized additive models, generalized linear models, generalized boosting trees, maximum entropy, multivariate adaptive regression splines and random forest were used. To combine the predictive capability of the six SDMs, their projections were aggregated into a consensus projection. To ensure the quality of the ensemble SDM, we only kept the projections for which the accuracy estimated by AUC and TSS were higher than 0.8 and 0.6, respectively, and each SDM was weighted proportional to its TSS evaluation. The species distribution modelling workflow was performed within the ‘biomod2’ R platform.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.3390/environments4040081
Identifying Reliable Opportunistic Data for Species Distribution Modeling: A Benchmark Data Optimization Approach
  • Nov 14, 2017
  • Environments
  • Yu-Pin Lin + 4 more

The purpose of this study is to increase the number of species occurrence data by integrating opportunistic data with Global Biodiversity Information Facility (GBIF) benchmark data via a novel optimization technique. The optimization method utilizes Natural Language Processing (NLP) and a simulated annealing (SA) algorithm to maximize the average likelihood of species occurrence in maximum entropy presence-only species distribution models (SDM). We applied the Kruskal–Wallis test to assess the differences between the corresponding environmental variables and habitat suitability indices (HSI) among datasets, including data from GBIF, Facebook (FB), and data from optimally selected FB data. To quantify uncertainty in SDM predictions, and to quantify the efficacy of the proposed optimization procedure, we used a bootstrapping approach to generate 1000 subsets from five different datasets: (1) GBIF; (2) FB; (3) GBIF plus FB; (4) GBIF plus optimally selected FB; and (5) GBIF plus randomly selected FB. We compared the performance of simulated species distributions based on each of the above subsets via the area under the curve (AUC) of the receiver operating characteristic (ROC). We also performed correlation analysis between the average benchmark-based SDM outputs and the average dataset-based SDM outputs. Median AUCs of SDMs based on the dataset that combined benchmark GBIF data and optimally selected FB data were generally higher than the AUCs of other datasets, indicating the effectiveness of the optimization procedure. Our results suggest that the proposed approach increases the quality and quantity of data by effectively extracting opportunistic data from large unstructured datasets with respect to benchmark data.

  • Research Article
  • 10.3390/biology14111476
Predicting Habitat Suitability and Range Dynamics of Three Ecologically Important Fish in East Asian Waters Under Projected Climate Change
  • Oct 23, 2025
  • Biology
  • Ifeanyi Christopher Nneji + 8 more

Simple SummaryClimate change poses a significant threat to ecologically important fish species, underscoring the need to predict potential shifts in their distributions. Using ensemble species distribution models based on occurrence data from GBIF and OBIS, we assessed the current and future distributions of Collichthys lucidus, Konosirus punctatus, and Clupanodon thrissa in East Asia under present and future climate scenarios. Key environmental predictors were dissolved oxygen and salinity for C. lucidus and chlorophyll and phosphate for K. punctatus and C. thrissa. Projections indicated a contraction of suitable habitats for C. lucidus, in contrast to range expansions for K. punctatus and C. thrissa. Given the limited protection of these species by existing marine protected areas (MPAs), our findings highlight the urgent need for adaptive conservation strategies, including the expansion and reconfiguration of MPAs to safeguard future habitats.The vulnerability of ecologically important fish species to climate change underscores the need to predict shifts in their distributions and habitat suitability under future climate scenarios. In this study, we modeled the potential distribution ranges of three ecologically important fish species (Collichthys lucidus, Konosirus punctatus, and Clupanodon thrissa) across East Asia using a species distribution modeling framework under both current and projected future climate scenarios. Occurrence data were obtained from the Global Biodiversity Information Facility (GBIF) and the Ocean Biodiversity Information System (OBIS), while environmental data were retrieved from the Bio-ORACLE database. Our models demonstrated high predictive performance (AUC > 0.88). Results showed that dissolved oxygen and salinity were the strongest bioclimatic predictors for C. lucidus, whereas chlorophyll and phosphate primarily shaped the distributions of K. punctatus and C. thrissa. Model projections indicated a decline in suitable habitats for C. lucidus, particularly under high-emission scenarios, and range expansions for K. punctatus and C. thrissa toward higher latitudes and nutrient-enriched waters. Highly suitable habitats were concentrated along coastlines within exclusive economic zones, exposing these species to significant anthropogenic pressures. Conservation gap analysis revealed that only 7%, 2%, and 6% of the distributional ranges of C. lucidus, C. thrissa, and K. punctatus, respectively, are currently encompassed by marine protected areas (MPAs). Our study further identified climatically stable regions that may act as climate refugia, particularly for C. lucidus in the Yellow and East China seas. Our findings highlight the urgent need for adaptive management, including the expansion and reconfiguration of MPAs, transboundary conservation initiatives, stronger regulation of exploitation, and increased public awareness to ensure the resilience of fisheries under future climate change.

  • Research Article
  • Cite Count Icon 3
  • 10.3897/biss.7.112957
Filling Gaps in Earthworm Digital Diversity in Northern Eurasia from Russian-language Literature
  • Sep 20, 2023
  • Biodiversity Information Science and Standards
  • Maxim Shashkov + 2 more

Data availability for certain groups of organisms (ecosystem engineers, invasive or protected species, etc.) is important for monitoring and making predictions in changing environments. One of the most promising directions for research on the impact of changes is species distribution modelling. Such technologies are highly dependent on occurrence data of high quality (Van Eupen et al. 2021). Earthworms (order Crassiclitellata) are a key group of organisms (Lavelle 2014), but their distribution around the globe is underrepresented in digital resources. Dozens of earthworm species, both widespread and endemic, inhabit the territory of Northern Eurasia (Perel 1979), but extremely poor data on them is available through global biodiversity repositories (Cameron 2018). There are two main obstacles to data mobilisation. Firstly, studies of the diversity of earthworms in Northen Eurasia have a long history (since the end of the nineteenth century) and were conducted by several generations of Soviet and Russian researchers. Most of the collected data have been published in "grey literature", now stored only in a few libraries. Until recently, most of these remained largely undigitised, and some are probably irretrievably lost. The second problem is the difference in the taxonomic checklists used by Soviet and European researchers. Not all species and synonyms are included in the GBIF (Global Biodiversity Information Facility) Backbone Taxonomy. As a result, existing earthworm species distribution models (Phillips 2019) potentially miss a significant amount of data and may underestimate biodiversity, and predict distributions inaccurately. To fill this gap, we collected occurrence data from the Russian language literature (published by Soviet and Russian researchers) and digitised species checklists, keeping the original scientific names. To find relevant literature, we conducted a keyword search for "earthworms" and "Lumbricidae" through the Russian national scientific online library eLibrary and screened reference lists from the monographs of leading Soviet and Russian soil zoologist Tamara Perel (Vsevolodova-Perel 1997, Perel 1979). As a result, about 1,000 references were collected, of which 330 papers had titles indicating the potential to contain data on earthworm occurrences. Among these, 219 were found as PDF files or printed papers. For dataset compilation, 159 papers were used; the others had no exact location data or duplicated data contained in other papers. Most of the sources were peer-reviewed articles (Table 1). A reference list is available through Zenodo (Ivanova et al. 2023). The earliest publication we could find dates back to 1899, by Wilhelm Michaelsen. The most recent publication is 2023. About a third of the sources were written by systematists Iosif Malevich and Tamara Perel. Occurrence data were extracted and structured according to the Darwin Core standard (Wieczorek et al. 2012). During the data digitisation process, we tried to include as much primary information as possible. Only one tenth of the literature occurrences contained the geographic coordinates of locations provided by the authors. The remaining occurrences were manually georeferenced using the point-radius method (Wieczorek et al. 2010). The resulting occurrence dataset Earthworm occurrences from Russian-language literature (Shashkov et al. 2023) was published through the Global Biodiversity Information Facility portal. It contains 5304 occurrences of 117 species from 27 countries (Fig. 1). To improve the GBIF Backbone Taxonomy, we digitised two catalogues of earthworm species published for the USSR (Perel 1979) and Russian Federation (Vsevolodova-Perel 1997) by Tamara Perel. Based on these monographs, three checklist datasets were published through GBIF (Shashkov 2023b, 124 records; Shashkov 2023c, 87 records; Shashkov 2023a, 95 records). Now we work towards including these names in the GBIF Backbone so that all species names can be matched and recorded exactly as mentioned in papers published by Soviet and Russian researchers.

  • Research Article
  • 10.3897/biss.4.59154
Occurrence Cubes: A new way of aggregating heterogeneous species occurrence data
  • Sep 30, 2020
  • Biodiversity Information Science and Standards
  • Damiano Oldoni + 2 more

The digital era has brought about an impressive increase in the volume of published species occurrence data. Research infrastructures such as the Global Biodiversity Information Facility (GBIF), the digitization of legacy data, and the use of mobile applications have all played a role in this transition. More data implies, unavoidably, more heterogeneity at multiple levels as a result of the different methods and standards used to collect data. Data standardization and aggregation help to reduce this heterogeneity. Furthermore, intermediate data products that can be used for activities such as mapping, modeling and monitoring improve the repeatability and reproducibility of biodiversity research (Kissling et al. 2017). Occurrences can be defined as events in a three-dimensional space where the dimensions are taxonomic (what), temporal (when) and spatial (where). They are then aggregated into what we coined occurrence cube (Fig. 1). The taxonomic dimension is categorical. Research infrastructures like GBIF use a taxonomic backbone, thus making data aggregation at species level or higher rank relatively easy. The temporal dimension is a continuum and the temporal uncertainty is usually lower than the typical aggregation span, typically a year. Regarding the spatial dimension, occurrences are typically filtered to remove those with too large an uncertainty to fit the grid scheme being used. Meaning that the spatial uncertainty is largely unused. We developed a method to take into account this spatial uncertainty while aggregating data. In particular, we state that an occurrence is spatially representable as a closed plane figure such as a circle, hexagon or square, never as the geometric centre (centroid) of it. As for GBIF occurrence data, the coordinateUncertaintyInMeters is defined as the radius describing the smallest circle containing the whole of the location (see Darwin Core standard). So, spatially speaking, we refer to occurrences as circles, even if the method described below is general. After harvesting the occurrence data and providing a data quality assessment (e.g. removing occurrences without coordinates or with suspicious coordinates) we can assign occurrences to a reference grid such as the European reference grid of the European Environment Agency (EEA) at 1 km scale. In this spatial aggregation we randomly choose a point within the occurrence circle and assign it to the grid cell in which it is contained. We can aggregate further by time (e.g. by year) and taxonomy (e.g. by species), where aggregating means counting how many occurrences are in each specific taxonomic-spatial-temporal unit. The analogy with geometry goes further: the occurrence cube can, as any cube, be projected on an orthogonal plane by aggregating along one of the three dimensions. In particular, projecting the cube on the taxonomic and temporal dimensions can be done by adding up the number of occurrences, or counting the number of occupied cells, thus estimating the area of occupancy. The occurrence cube paradigm has been developed within the Tracking Invasive Alien Species (TrIAS) project (Vanderhoeven et al. 2017) following Open Science and FAIR principles. We created and published occurrence cubes at the species level for Belgium and Italy (Oldoni et al. 2020b) and the occurrence cubes for non-native taxa in Belgium and Europe (Oldoni et al. 2020a).

  • Research Article
  • 10.3390/insects16080769
Investigating the Spatial Biases and Temporal Trends in Insect Pollinator Occurrence Data on GBIF
  • Jul 26, 2025
  • Insects
  • Ehsan Rahimi + 1 more

Research in biogeography, ecology, and biodiversity hinges on the availability of comprehensive datasets that detail species distributions and environmental conditions. At the forefront of this endeavor is the Global Biodiversity Information Facility (GBIF). This study focuses on investigating spatial biases and temporal trends in insect pollinator occurrence data within the GBIF dataset, specifically focusing on three pivotal pollinator groups: bees, hoverflies, and butterflies. Addressing these gaps in GBIF data is essential for comprehensive analyses and informed pollinator conservation efforts. We obtained occurrence data from GBIF for seven bee families, six butterfly families, and the Syrphidae family of hoverflies in 2024. Spatial biases were addressed by eliminating duplicate records with identical latitude and longitude coordinates. Species richness was assessed for each family and country. Temporal trends were examined by tallying annual occurrence records for each pollinator family, and the diversity of data sources within GBIF was evaluated by quantifying unique data publishers. We identified initial occurrence counts of 4,922,390 for bees, 1,703,131 for hoverflies, and 31,700,696 for butterflies, with a substantial portion containing duplicate records. On average, 81.4% of bee data, 77.2% of hoverfly data, and 65.4% of butterfly data were removed post-duplicate elimination for dataset refinement. Our dataset encompassed 9286 unique bee species, 2574 hoverfly species, and 17,895 butterfly species. Our temporal analysis revealed a notable trend in data recording, with 80% of bee and butterfly data collected after 2022, and a similar threshold for hoverflies reached after 2023. The United States, Germany, the United Kingdom, and Sweden consistently emerged as the top countries for occurrence data across all three groups. The analysis of data publishers highlighted iNaturalist.org as a top contributor to bee data. Overall, we uncovered significant biases in the occurrence data of pollinators from GBIF. These biases pose substantial challenges for future research on pollinator ecology and biodiversity conservation.

  • Research Article
  • Cite Count Icon 6
  • 10.1038/s41598-024-76524-5
Predictions of species distributions based only on models estimating future climate change are not reliable
  • Oct 28, 2024
  • Scientific Reports
  • Spyros Tsiftsis + 3 more

Changes in climate and land use are the most often mentioned factors responsible for the current decline in species diversity. To reduce the effect of these factors, we need reliable predictions of future species distributions. This is usually done by utilizing species distribution models (SDMs) based on expected climate. Here we explore the accuracy of such projections: we use orchid (Orchidaceae) recordings and environmental (mainly climatic) data from the years 1901–1950 in SDMs to predict maps of potential species distributions in 1980–2014. This should enable us to compare the predictions of species distributions in 1980–2014, based on records of species distribution in the years 1901–1950, with real data in the 1980–2014 period. We found that the predictions of the SDMs often differ from reality in this experiment. The results clearly indicate that SDM predictions of future species distributions as a reaction to climate change must be treated with caution.

  • Research Article
  • Cite Count Icon 3
  • 10.3897/biss.3.37036
Going Molecular: Sequence-based spatiotemporal biodiversity evidence in GBIF
  • Jun 13, 2019
  • Biodiversity Information Science and Standards
  • Dmitry Schigel + 9 more

The Global Biodiversity Information Facility (GBIF) was established by governments in 2001, largely through the initiative and leadership of the natural history collections community, following the 1999 recommendation by a working group under the Megascience Forum (predecessor of the Global Science Forum) of the Organization for Economic Cooperation and Development (OECD). Over 20 years, GBIF has helped develop standards and convened a global community of data-publishing institutions, aggregrating over one billion specimen occurrence records freely and openly available for use in research and policy making. These GBIF mediated data range from vouchered museum specimens to observation records generated by humans and machines. New data are being generated from integrated remote sensing, ecological sampling, and molecular sequencing that have strong geospatial components but lack traditional vouchers. GBIF is working with partners to develop best practices of bringing this data into the GBIF architecture. Following discussions during the second Global Biodiversity Information Conference in 2018, GBIF and the European Bioinformatics Institute (EMBL-EBI), supported by ELIXIR, have extended collaboration to share species occurrence records known only from their genetic material. When these data providers contribute data coordinates along with the sequences to the European Nucleotide Archive (ENA), the records will appear on GBIF maps and in spatial searches. This collaboration enables significant new molecular data streams to become discoverable through GBIF.org: by mid-March 2019, over 7.8m individual occurrence records via the ENA, and over 13.2m records as standardized Darwin Core sampling-event datasets via MGnify, a resource that provides taxonomic and functional annotations on sequences derived from environmental sequencing projects. Sequence-based occurrence records published by ENA and MGnify boost representation of microbial diversity which was underrepresented at GBIF. The ELIXIR-ENA-MGnify-GBIF partnership is working on further refinement of the dynamic data linkages, frequency of updates and other improvements. The API-based tool that connects GBIF data infrastructures is open to new data contributors and for indexes of molecular occurrences. Indexing of these data streams is dependent on the presence of a name (any rank) with the sequence. Under the current Codes of nomenclature, animals, fungi, plants, and algae cannot be described based on exclusively sequence data. Yet, a significant volume of biodiversity data has only been represented by DNA sequences. Barcoding and sequence clustering procedures vary among taxa and research communities, but clusters can be related to a taxon with a Latin name. Many DNA similarity clusters do not contain a sequence from a formally described taxon; however these sequence clusters provide provisional molecular names for nomenclatural communication. In the best cases, curated libraries of reference sequences, their metadata, clusters, alignments, and links to individuals and physical material become de facto naming conventions for certain taxonomic groups, and co-exist with Latin names. Integration of molecular names into the taxonomic backbone of GBIF started with Fungi and UNITE, a data management and identification environment for fungal ITS barcodes with 87,000+ fungal species hypotheses demarcating 800,000+ sequence specimens as of March 2019. Checklist publication of all names in UNITE through GBIF.org including Linnaean names and stable, DOI-trackable molecular sequence based ‘species hypotheses’, enables indexing of fungal metabarcoding data worldwide, such as BIOWIDE. As names are currently essential to indexing the world’s occurrence data, GBIF will develop similar linkages with names in the Barcode of Life data system (BOLD) and in SILVA - a resource for high-quality ribosomal RNA sequence data and taxonomy, and welcomes other reference systems to this development. Expanding the molecular data streams (Fig. 1) allows GBIF to address spatial, temporal and taxonomic gaps and biases, and to support large-scale data-intensive research openly and worldwide.

  • Dataset
  • 10.34725/dvn/24818
Replication data for: Developing a Georeferenced Database of Selected Threatened Forest Tree Species in the Philippines
  • Aug 6, 2019
  • Lawrence Tolentino Ramos + 3 more

Georeferenced species occurrence is a prerequisite in species distribution modeling and species ecosystem correlation analysis and also aids in tracking plant species and prioritizing scarce resources for conservation. The Global Biodiversity Information Facility, legacy literature of biodiversity, contemporary literature, technical reports and biodiversity surveys are important sources of species occurrence data waiting to be georeferenced. In this paper, we discussed a method used to georeference occurrences of threatened forest tree species from the above sources. Locality descriptions were initially narrowed down in geographic information system using administrative maps and further confined using two criteria: 1) elevation and 2) surface cover information from remotely-sensed images. The result was a georeferenced database of 2,067 occurrence records of 47 threatened forest species on a national scale . Each record had a unique point feature per species and enough metadata directing the database user to the source of occurrence data. The database can be used as a tool in determining priority species for specimen or germplasm collection, for taxonomic identification and historical mapping. It also serves as an integral component in spatially modeling the distribution of tree species and forest formations in the past and in a possible future scenario.

  • Research Article
  • Cite Count Icon 1
  • 10.3897/biss.5.74052
An Image is Worth a Thousand Species: Scaling high-resolution plant biodiversity prediction to biome-level using citizen science data and remote sensing imagery
  • Sep 10, 2021
  • Biodiversity Information Science and Standards
  • Lauren Gillespie + 2 more

Accurately mapping biodiversity at high resolution across ecosystems has been a historically difficult task. One major hurdle to accurate biodiversity modeling is that there is a power law relationship between the abundance of different types of species in an environment, with few species being relatively abundant while many species are more rare. This “commonness of rarity,” confounded with differential detectability of species, can lead to misestimations of where a species lives. To overcome these confounding factors, many biodiversity models employ species distribution models (SDMs) to predict the full extent of where a species lives, using observations of where a species has been found, correlated with environmental variables. Most SDMs use bioclimatic environmental variables as the dependent variable to predict a species’ range, but these approaches often rely on biased pseudo-absence generation methods and model species using coarse-grained bioclimatic variables with a useful resolution floor of 1 km-pixel. Here, we pair iNaturalist citizen science plant observations from the Global Biodiversity Information Facility with RGB-Infrared aerial imagery from the National Aerial Imagery Program to develop a deep convolutional neural network model that can predict the presence of nearly 2,500 plant species across California. We utilize a state-of-the-art multilabel image recognition model from the computer vision community, paired with a cutting-edge multilabel classification loss, which leads to comparable or better accuracy to traditional SDM models, but at a resolution of 250m (Ben-Baruch et al. 2020, Ridnik et al. 2020). Furthermore, this deep convolutional model is able to accurately predict species presence across multiple biomes of California with good accuracy and can be used to build a plant biodiversity map across California with unparalleled accuracy. Given the widespread availability of citizen science observations and remote sensing imagery across the globe, this deep learning-enabled method could be deployed to automatically map biodiversity at large scales.

  • Research Article
  • Cite Count Icon 444
  • 10.1111/j.1365-2664.2007.01408.x
The influence of spatial errors in species occurrence data used in distribution models
  • Nov 2, 2007
  • Journal of Applied Ecology
  • Catherine H Graham + 5 more

Summary Species distribution modelling is used increasingly in both applied and theoretical research to predict how species are distributed and to understand attributes of species’ environmental requirements. In species distribution modelling, various statistical methods are used that combine species occurrence data with environmental spatial data layers to predict the suitability of any site for that species. While the number of data sharing initiatives involving species’ occurrences in the scientific community has increased dramatically over the past few years, various data quality and methodological concerns related to using these data for species distribution modelling have not been addressed adequately. We evaluated how uncertainty in georeferences and associated locational error in occurrences influence species distribution modelling using two treatments: (1) a control treatment where models were calibrated with original, accurate data and (2) an error treatment where data were first degraded spatially to simulate locational error. To incorporate error into the coordinates, we moved each coordinate with a random number drawn from the normal distribution with a mean of zero and a standard deviation of 5 km. We evaluated the influence of error on the performance of 10 commonly used distributional modelling techniques applied to 40 species in four distinct geographical regions. Locational error in occurrences reduced model performance in three of these regions; relatively accurate predictions of species distributions were possible for most species, even with degraded occurrences. Two species distribution modelling techniques, boosted regression trees and maximum entropy, were the best performing models in the face of locational errors. The results obtained with boosted regression trees were only slightly degraded by errors in location, and the results obtained with the maximum entropy approach were not affected by such errors. Synthesis and applications. To use the vast array of occurrence data that exists currently for research and management relating to the geographical ranges of species, modellers need to know the influence of locational error on model quality and whether some modelling techniques are particularly robust to error. We show that certain modelling techniques are particularly robust to a moderate level of locational error and that useful predictions of species distributions can be made even when occurrence data include some error.

  • Research Article
  • Cite Count Icon 2
  • 10.1088/1755-1315/1133/1/012026
Smart farming: modeling distribution of Xanthomonas campestris pv. oryzae as a leaf blight-causing bacteria in rice plants
  • Jan 1, 2023
  • IOP Conference Series: Earth and Environmental Science
  • M H Saputra + 3 more

Regarding plant protection in agriculture, it has been known that the cause of leaf blight is the bacteria named Xanthomonas campestris pv. oryzae. This paper aims to conduct a species distribution model of leaf blight-causing bacteria in rice plants and elaborates on its habitat suitability throughout Indonesia within the climate change context. The occurrences data was extracted from the Global Biodiversity Information Facility (GBIF), whereas the climate data was obtained from the Worldclim data set. The Species Distribution Model used the Maximum entropy method available on the Ecocommons website. The result shows that the bacteria occurrences positively correlate with several climatic variables and spread throughout the archipelago presented into five classes. Main islands such as Java, Bali, and Sumatra share areas with the highest suitability values. While Kalimantan and Sulawesi only share small areas with high suitability area. Papua has a less suitable location for the bacteria to spread. Rice cultivation is inseparable from the threat of pests and diseases. It can cause losses in the form of decreased production to crop failure. Therefore, The Species Distribution Model needed to identify areas where the vector is likely to occur. This way, mitigation or even prevention efforts could be made effectively.

  • Research Article
  • Cite Count Icon 2
  • 10.3897/biss.3.35829
Data Location Quality at GBIF
  • Jun 13, 2019
  • Biodiversity Information Science and Standards
  • John Waller

I will cover how the Global Biodiversity Information Facility (GBIF) handles data quality issues, with specific focus on coordinate location issues, such as gridded datasets (Fig. 1) and country centroids. I will highlight the challenges GBIF faces identifying potential data quality problems and what we and others (Zizka et al. 2019) are doing to discover and address them. GBIF is the largest open-data portal of biodiversity data, which is a large network of individual datasets (> 40k) from various sources and publishers. Since these datasets are variable both within themselves and dataset-to-dataset, this creates a challenge for users wanting to use data collected from museums, smartphones, atlases, satellite tracking, DNA sequencing, and various other sources for research or analysis. Data quality at GBIF will always be a moving target (Chapman 2005), and GBIF already handles many obvious errors such as zero/impossible coordinates, empty or invalid data fields, and fuzzy taxon matching. Since GBIF primarily (but not exclusively) serves lat-lon location information, there is an expectation that occurrences fall somewhat close to where the species actually occurs. This is not always the case. Occurrence data can be hundereds of kilometers away from where the species naturally occur, and there can be multiple reasons for why this can happen, which might not be entirely obvious to users. One reasons is that many GBIF datasets are gridded. Gridded datasets are datasets that have low resolution due to equally-spaced sampling. This can be a data quality issue because a user might assume an occurrence record was recorded exactly at its coordinates. Country centroids are another reason why a species occurrence record might be far from where it occurs naturally. GBIF does not yet flag country centroids, which are records where the dataset publishers has entered the lat-long center of a country instead of leaving the field blank. I will discuss the challenges surrounding locating these issues and the current solutions (such as the CoordinateCleaner R package). I will touch on how existing DWCA terms like coordinateUncertaintyInMeters and footprintWKT are being utilized to highlight low coordinate resolution. Finally, I will highlight some other emerging data quality issues and how GBIF is beginning to experiment with dataset-level flagging. Currently we have flagged around 500 datasets as gridded and around 400 datasets as citizen science, but there are many more potential dataset flags.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.