Abstract

Advances in computing, statistics, and technology over the past few decades has resulted in the accumulation of massive amounts of biodiversity data, as well as novel methods for using and integrating them (Miller et al. 2019). Data that have been collected for decades or even centuries can now be analyzed and applied in brand new ways. These data include alternative sources of information such as citizen science (also called community science) programs (Butcher et al. 1990, Sullivan et al. 2014, Hudson et al. 2017). In the midst of a global biodiversity crisis, these databases may hold the key to detecting important obstacles and threats to conservation, as well as determining the best interventions before it is too late. The ecological processes in question frequently occur across broad spatial and temporal scales; migratory species' ranges can span thousands of kilometers, and the impacts of threats such as climate change cannot necessarily be assessed on a local scale. Conventional methods of biodiversity monitoring may therefore be inadequate in the face of broad scale change. The individual researcher can generally only collect data on a localized scale, yet with the proper data practices, this information can contribute to a better understanding of the bigger picture (Sutherland et al. 2009). Data collected through biodiversity monitoring are a crucial component of solving any conservation problem. Gaining information about the state of an ecological system can inform better decisions to be made surrounding protections of that system (Bennett et al. 2018). However, with biodiversity rapidly declining, monitoring can delay much-needed action and take up valuable and limited resources, and therefore may not always be the best option (Bennett et al. 2018, Buxton et al. 2020). Additionally, in many cases the data needed to answer pressing conservation questions already exist, and need only to be made more accessible or simply used. The increasing availability of open data through large data repositories and alternative sources of data, such as citizen science programs, means that researchers should not have to compromise between fast action and informed action. Statistical methods of data integration can allow researchers to fill perceived spatial and temporal knowledge gaps using multiple existing datasets (Miller et al. 2019, Zipkin et al. 2019, Isaac et al. 2020), and analytical tools are available that can help quantify whether more data are needed (Canessa et al. 2015, Bennett et al. 2018). In the symposium entitled “Minimizing Data Waste: Conservation in the Big Data Era,” we explored how using open and available data not only makes the best and most efficient use of limited resources, but can lead to better conservation outcomes. We investigated how data integration helps improve our understanding of species trends and distributions, better evaluate ecological systems, and redistribute limited resources from monitoring to action. We emphasized why these advances in open data and data integration are critical not only for minimizing data waste, but also conducting better conservation research and directly improving conservation management decisions. Following are summaries of each of the four speakers' presentations. The first presentation (Knight) provided a definition of data integration, and showed several examples of how integrating multiple datasets led to improve knowledge of common nighthawks (Chordeiles minor) across their full annual cycle. The second presentation (Dansereau) demonstrated how using large open datasets helped to reveal key insights into the distribution of ecological uniqueness, and identify potential conservation targets over broad spatial scales. The third presentation (Momeni-Dehaghi) showed how available citizen science data for Monarch butterflies (Danaus plexippus) was used to map premigration distributions, with implications toward their mortality rates. The final presentation (Binley) sought to compare the outcomes of using freely available citizen science data versus paid professional surveys to prioritize conservation action, both in terms of costs and biodiversity. The final section of this report synthesizes these presentations, addresses some topics that were covered in the Q&A roundtable sessions, and provides an outlook on the future of conservation in the big data era. Key points from this symposium are summarized in Fig. 1. This presentation demonstrated how data integration of existing monitoring datasets can improve our understanding of ecology and conservation needs. Data integration was defined broadly as the statistical combination of datasets collected under varying protocols (Miller et al. 2019, Isaac et al. 2020). Integration can be particularly useful for answering questions across large temporal and/or spatial extents and for poorly understood species because it can pool information from multiple sparse datasets. Many data integration approaches are centered around accounting for differences in probability of detection and its subcomponents, availability (the probability an individual provides a cue for detection if it is present; sensu Marsh and Sinclair 1989) and perceptibility (the probability that cue is detected if the cue is provided) by using submodel(s) to estimate the probability of detection (e.g. MacKenzie et al. 2002, Sólymos et al. 2013, Miller et al. 2019). Knight presented several data integration examples for the common nighthawk, which is a declining, poorly understood, nocturnal, long-distance migratory bird whose distribution spans most of the western hemisphere (Brigham et al. 2020). First, by combining general citizen science datasets with more targeted datasets for understanding population trends and habitat relationships (Knight et al. 2021b). Next, by integrating various human observer datasets to calculate the probability of common nighthawk detection under various conditions (Sólymos et al. 2013, Edwards et al. 2022). Finally, by using those probabilities to integrate multiple monitoring datasets and evaluate various full annual cycle hypotheses for causes of common nighthawk population declines. The full annual cycle analysis was informed by satellite tracking data (Knight et al. 2021a). Data integration provided multiple benefits along the journey to understand common nighthawk population declines. First, integrating general monitoring datasets with targeted nocturnal monitoring increased the probability of detecting a 30% population decline from 38% to 69% (Knight et al. 2021b). Integrating the two monitoring programs also improved the predictive performance of species distribution modeling. Data integration for calculating probability of detection improved the temporal, spatial, and data type coverage of statistical offsets. Finally, data integration for understanding full annual cycle causes of population declines resulted in more significant population trends detected and stronger results more aligned with common nighthawk ecology. Together, this data integration journey emphasizes the value of data integration for overcoming statistical hurdles, particularly for conservation research of data sparse species. In addition to large citizen science datasets used in these analyses like eBird and the North American Breeding Bird Survey, the analyses also used the Boreal Avian Modelling project database (Barker et al. 2015), the Canadian Nightjar Survey dataset (Knight et al. 2019), the Nightjar Survey Network dataset (Centre for Conservation Biology 2022), and the Common Nighthawk Migratory Connectivity Project (Knight et al. 2021a). Those datasets were accessed via several data portals which also provide open access to many other citizen science and public datasets (Table 1). The data integration approaches used were based on the “QPAD” framework (Sólymos et al. 2013), which is a flexible framework for data integration that allows the user to estimate availability and perceptibility and use those estimates as statistical offsets in a wide range of modeling approaches. Models and estimates for over 100 boreal landbird species are available in the QPAD R package. Open data hold an underexploited potential to improve ecological indicators used in conservation research. For instance, indicators such as ecological uniqueness, which can help identify areas with exceptional species composition and potential conservation targets, are often limited to local or regional scales because of sampling limitations for single studies. In contrast, this presentation showed how open data allows measuring uniqueness over broad spatial extents and new data types while improving knowledge about the indicator itself. Based on Dansereau et al. (2022), this presentation showed that uniqueness could be predicted over broad spatial extents, revealing additional sources of variation in space. They used species distribution modeling to predict community composition across North America based on eBird citizen science data (Sullivan et al. 2014). Then, they measured ecological uniqueness using the local contributions to beta diversity (Legendre and De Cáceres 2013) and explored their spatial variation across different regions and scales. Uniqueness and its relationship to species richness changed according to the area under study and was affected by regional factors such as extent size, richness profile, and proportion of rare species. As a result, sites identified as unique may vary according to regional characteristics, which should be considered when this indicator is used for conservation recommendations. In addition, this presentation showed how open data could help extend uniqueness assessments to species interactions. They combined community predictions with an open interaction metaweb (Strydom et al. 2022) to produce localized predictions of ecological networks, then measured uniqueness separately based on interaction and community composition (Poisot et al. 2017). Interaction uniqueness showed a different spatial distribution from community uniqueness over whole regions, highlighting that sites and areas may be unique in one community aspect and not the other (e.g. unique communities without unique interaction networks). Therefore, considering different community components through open data sources can reveal additional important conservation targets, especially over broad spatial scales. To conserve migratory species effectively, we need to know their distribution at different stages of their life cycle. For the Monarch butterfly (Danaus plexippus), although it is well-known that they overwinter in oyamel fir (Abies religiosa) forests in the mountains west of Mexico City, the natal origins of these overwintering Monarchs are the only information we have about their premigration distribution (e.g. Flockhart et al. 2017). However, the premigration distribution and the natal origins of overwintering Monarchs (i.e. postmigration data) can be considered equivalent only if we assume that Monarchs originated from different regions have a similar mortality rate during their migration (Momeni-Dehaghi et al. 2021). In his talk, Momeni-Dehaghi demonstrated how the premigration map they developed using community science data can contribute to the conservation of Monarch butterflies (Momeni-Dehaghi et al. 2021). To estimate Monarchs' premigration distribution, the authors used data reported by citizen scientists in the Journey North program (http://www.journeynorth.org/) before Monarchs start their fall migration (i.e. before migration mortality), controlling for sampling bias. Momeni-Dehaghi compared the resulting distribution to distributions estimated using postmigration data (i.e. isotopic-based natal origin assignments) to determine whether migration mortality varies between butterflies originating from different regions. This comparison suggests that Monarchs starting their migration from North-central breeding region have a higher mortality rate than other regions. In contrast, those which originated from Northwest and Southeast breeding regions have a lower mortality rate relative to other Monarchs. Their premigration distribution map will be useful in future studies estimating the rates, distribution, and causes of mortality in migrating Monarchs. These results have clear implications for when and where conservation action should be prioritized for protecting these species. Given the broad, spatially complex ranges of migratory species such as the Monarch butterfly, large open datasets such as those collected through community science programs are essential to understanding the threats that face them throughout their full annual cycle. An additional benefit of using freely available open data is that it serves as an effective means to collect vast quantities of data in a cost-efficient manner. This may prove invaluable to conservation efforts, as using crowd-sourced data to inform decisions can allow managers to redirect limited funds towards action rather than monitoring. This presentation demonstrated the quantitative benefit of using citizen science data for setting conservation priorities in an applied conservation setting. Using data from the BirdReturns conservation program implemented by The Nature Conservancy in central California, rice farms were prioritized for conservation action based on the modeled probability of detection of seven shorebird species using two datasets: eBird citizen science data, and monitoring data collected through professional surveys conducted by The Nature Conservancy (Robinson et al. 2020). The value of these prioritizations was assessed using an integrated dataset that combined both the community and professionally collected data as a benchmark. Prioritizations conducted using the professional monitoring data were subject to a monitoring penalty, where the cost of monitoring was deducted from the overall budget, leaving less remaining to pay for conservation action. Prioritizations were then run across a range of different available budgets. The authors predicted that decisions based solely on eBird data would be preferable at lower budgets given that more money can be spent on action rather than monitoring, but that the more targeted professional monitoring may provide better value at larger budgets. Contrary to those predictions, prioritizing detections across all seven species based on the model using eBird data resulted in the greatest overall value across all budgets. The difference was greatest at lower budgets, but prioritizations based on eBird data consistently performed better until the budget was large enough so that all properties could be enrolled in the program. Furthermore, prioritizations based on citizen science data performed comparably to those based on the integrated model (i.e. the best available information). Even when the monitoring penalty was removed from the professional monitoring prioritization, allowing for a more direct comparison of information content, the eBird prioritization performed comparably or better than professional monitoring. This demonstrates that, in this case study, eBird citizen science data matched or surpassed the capacity of professional monitoring data to inform conservation decisions. This presentation quantified the trade-offs between monitoring and action, to better illustrate to conservation managers the potential risks associated with unnecessary data collection. In this case study, there was no benefit to spending money on professional monitoring at any budget. Using openly and freely available data, resources can be redistributed towards actions that will directly benefit biodiversity, ultimately resulting in better overall outcomes. This symposium presented a small subset of advances in ecology and conservation through the use of Big Data. While collecting data will always remain important in these fields, many pressing questions can be more immediately and satisfactorily answered using data already available. The selected speakers and talks presented here demonstrated how innovative and integrative approaches using Big Data are a necessary next step in the evolution of the field of conservation, not only in making the most efficient use of resources, but also for better understanding the ecological systems we are trying to protect. Each presentation explicitly showed how the knowledge obtained from Big Data integration represents an improvement over using locally sourced data alone. The first three talks provided an overview on how open data integration can enhance our understanding of several biodiversity metrics that are critical to making conservation decisions, and the last quantified the direct benefit of doing so, both in terms of dollars and biodiversity. During the roundtable discussions, one topic brought up was the well-known issue of taxonomic biases in the collection of monitoring data (Mair and Ruete 2016, Troudet et al. 2017, Binley et al. 2023). Indeed, this was evident in the selection of speakers for this symposium. Although all speakers presented different applications and analyses of big data in ecology, three of the four speakers focused on birds and analyses using large bird-focused datasets. Whereas Knight, Dansereau, and Binley had access to millions of bird observations from large citizen science platforms (i.e. eBird, BBS, etc.), Momeni-Dehaghi had access to only one program with much less available data. We agree that this limited availability of open and available data for other under-sampled taxa does pose a large issue in tackling broad-scale biodiversity issues across a range of taxa. For professionally collected data, we encourage researchers to continue making their data as open and available as possible. For example, programs such as entoGEM (Grames et al. 2022) provide a great example of a collection of insect data, and the website WildTrax (https://wildtrax.ca) provides access to data collected by an array of environmental sensors such as autonomous recording units and camera traps, as well as professional point counts. We also note the rise in availability of citizen science data through programs such as eButterfly (Prudic et al. 2017), as well as the use of citizen science data in studying jellyfish blooms (Marambio et al. 2021), fisheries management (Bellquist et al. 2022), amphibian conservation (Lee et al. 2021), mosquito surveillance (Sousa et al. 2022), and for threatened species monitoring (Soroye et al. 2022). We encourage the continual development of these citizen science programs that target undersampled taxa in creative and engaging ways. Another topic of discussion surrounded the availability of datasets and use of tools to access these large datasets. The speakers discussed some of the tools they used to access the data for their analyses (Table 1), but noted that there are likely several other tools beyond the scope of what was highlighted in these presentations. Given the increasing need to house large datasets in databases, the development of computational tools to access these data has become paramount, and we encourage owners and managers of large biological databases to continue developing open easy-to-use tools to allow for further access to these large databases (Fortunato and Galassi 2021). Additionally, we encourage the movement to continue to make data findable, accessible, interoperable, and reusability (i.e. FAIR data; Wilkinson et al. 2016) to further facilitate data integration in ecology and evolution (O'Dea et al. 2021). Finally, a major topic during the roundtable discussion surrounded that of the use and reliability of citizen science data. One common criticism of citizen science data is the taxonomic and spatial biases, as well as potential for lack of (or differences in) structure in data collection. Binley noted, however, that though these are fair criticisms for citizen science data, these issues also commonly arise in professionally collected data as well. Binley also noted that many large-scale citizen science platforms such as eBird have thought extensively about controlling for these biases (Johnston et al. 2021), and so researchers that are hesitant about using alternative data sources should take some comfort in knowing that many of the expected biases have been accounted for (Ellwood et al. 2017). The objective of this symposium was to demonstrate examples of large, open biodiversity databases that already exist, as well as methods for working with and integrating these datasets to make the best use of available information. It is well known that we are in a global biodiversity crisis, and so we must act fast to understand the current states of species' ecologies, including population trends, distributions, excess causes of mortalities, etc., and use this knowledge to implement conservation action to species who need it the most. This is particularly important for modern conservation research, which continues to face limited conservation budgets yet allocate over half of the budget to just monitoring (Buxton et al. 2020). By making use of data already available, conservation managers and practitioners can look to implement conservation actions sooner rather than later, to avoid the situation of species being “monitored to death” (Lindenmayer et al. 2013). The concepts explored during this session are of general interest and applicability to anyone in the field of ecology or conservation, particularly those conducting work on broad spatial and temporal scales. Given the vast investment of time and resources that has already gone into collecting these data, and the time-sensitive nature of conservation research, it is vital that researchers not only familiarize themselves with what data is already available, but also the methods required to make use of them. In a world where the volume of biodiversity data continues to rise, yet biodiversity itself continues to decline, the ability to make use of open and available datasets to inform future conservation decisions in a timely manner will play a critical role in preventing further extinctions. A. Binley and B. Edwards contributed equally to this manuscript.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call