Abstract

Geobiology explores how Earth's system has changed over the course of geologic history and how living organisms on this planet are impacted by or are indeed causing these changes. For decades, geologists, paleontologists, and geochemists have generated data to investigate these topics. Foundational efforts in sedimentary geochemistry utilized spreadsheets for data storage and analysis, suitable for several thousand samples, but not practical or scalable for larger, more complex datasets. As results have accumulated, researchers have increasingly gravitated toward larger compilations and statistical tools. New data frameworks have become necessary to handle larger sample sets and encourage more sophisticated or even standardized statistical analyses. In this paper, we describe the Sedimentary Geochemistry and Paleoenvironments Project (SGP; Figure 1), which is an open, community-oriented, database-driven research consortium. The goals of SGP are to (1) create a relational database tailored to the needs of the deep-time (millions to billions of years) sedimentary geochemical research community, including assembling and curating published and associated unpublished data; (2) create a website where data can be retrieved in a flexible way; and (3) build a collaborative consortium where researchers are incentivized to contribute data by giving them priority access and the opportunity to work on exciting questions in group papers. Finally, and more idealistically, the goal was to establish a culture of modern data management and data analysis in sedimentary geochemistry. Relative to many other fields, the main emphasis in our field has been on instrument measurement of sedimentary geochemical data rather than data analysis (compared with fields like ecology, for instance, where the post-experiment ANOVA (analysis of variance) is customary). Thus, the longer-term goal was to build a collaborative environment where geobiologists and geologists can work and learn together to assess changes in geochemical signatures through Earth history. With respect to the data product, SGP is focused on assembling a well-vetted and comprehensive dataset that is tractable to multivariate statistical analyses accounting for multiple geological and methodological biases. Phase 1 of the project, which focused on the Neoproterozoic and Paleozoic, has been completed. Future phases will capture a broader range of geologic time, data types, and geography. The database contains tens of thousands of unpublished data points provided by consortium members, as well as detailed metadata that go beyond what is contained in papers. In many cases, these represent measurements that are tangential to a given published study but still of high utility to database studies; these allow the community to address questions that would be impossible to answer solely with the published data. For instance, in order to use a proxy such as Mo/TOC (total organic carbon) ratios in mudrocks deposited under a euxinic water column, the full suite of trace metal, iron speciation, and total organic carbon data is needed. Likewise, geospatial information is required to account for sampling biases, and many statistical learning approaches cannot accept, or have difficulty with, incomplete geological predictor variables. Ultimately, it is this complete data matrix that will allow for SGP’s most insightful analyses. This paper serves as an introduction to SGP, the process by which our data products are created, a description of the Phase 1 data product and a citable reference for that product, a description of the SGP website and API (Application Programming Interface) for open access, and a statement of our future goals. In recent years, there has been a welcome trend in the broader geochemical community toward increased data accessibility, documentation of sample context, and sample curation, albeit with challenges still ahead (Brantley et al., 2020; Cutcher-Gershenfeld et al., 2016; Planavsky et al., 2020). First, progress has been made through journals and organizations adopting stringent data archiving rules and promoting adherence to FAIR principles—findability, accessibility, interoperability, and reusability (“FAIR Play in Geoscience Data,” 2019; Wilkinson et al., 2016). Second, several databases now house geochemical data at different scales and with different focuses (Brantley et al., 2020; Gard et al., 2019; He et al., 2019; Lehnert et al., 2000). Among the largest and most active are projects such as EarthChem (earthchem.org), the Geobiodiversity Database (geobiodiversity.com), Pangaea (https://www.pangaea.de), and the StabisoDB (https://cnidaria.nat.uni-erlangen.de/stabisodb/). The SGP database was built with the data structures and standards of these other projects in mind, in keeping with FAIR principles and with the hope that data can be easily shared in the future. Consistent with the stance taken by other organizations in the community (Hanson, 2016), we also strongly encourage all members to register their samples for an International Geo Sample Number (IGSN; i.e., globally unique alphanumeric sample identifiers), which can be obtained from the System for Earth Sample Registration (www.geosamples.org). However, SGP is a domain-specific project that differs from other databases in the way the data are collected, the nature of the data collected, and the tailored way in which they are presented to our research community. Although some other databases contain sedimentary geochemical data, the vast majority of deep-time data is not available from any single source, and samples are not readily associated with critical contextual data—such as age constraints and environmental data—necessary for the types of proxy-through-time and/or environmental studies typically conducted in historical geobiology. When the SGP was founded in 2015, we believed that a “team science” philosophy would be the most effective way to move beyond spreadsheets to the type and abundance of data required. The research consortium framework we have implemented is modeled after mature consortia in human statistical genetics, such as the Psychiatric Genomics Consortium (PGC). In the PGC, researchers have aggregated data to make statistically robust observations and landmark findings not possible with the data generated by any single research group alone (Duncan et al., 2017; Schizophrenia Working group of the Psychiatric Genomics Consortium, 2014; Wray et al., 2018). Similar to biomedical research consortia, we hope that the intellectual and collaborative environment fostered by SGP will ultimately be as important as our data products or specific insights in research papers. The first priority for Phase 1 of SGP was to assemble or generate multi-proxy sedimentary geochemical data (carbon and sulfur abundances and isotopes, iron speciation, major and trace metal abundances, and trace metal isotopes, primarily from fine-grained siliciclastic rocks) from multiple regions worldwide for every Paleozoic Epoch and equivalent ~25 Myr Neoproterozoic time slice. In addition to data compilation, this has involved an effort by SGP members to generate new geochemical data from “background” intervals in the Paleozoic (i.e., not associated with events such as mass extinctions or significant climatic shifts). The first phase of data collection came to an end in 2019. At that point, a copy of the database was vetted by SGP team members and then archived—the first data “freeze” (following the best-practices approach used in medical consortia). Working groups were formed (with working group leadership established through an open call to SGP team members), and data were made available to Working group analysts via the website and through tailored queries. The first working group papers have recently been published (LeRoy et al., 2021; Lipp et al., 2021; Mehra et al., 2021), and more are in progress. Meanwhile, data collection continues, and the Phase 2 goal is to include more Mesozoic–Cenozoic and pre-Neoproterozoic time intervals and to expand the geochemical record to more diverse lithologies and grain-specific phases. The Phase 2 data freeze is currently anticipated for 2023, followed by data vetting and analyses toward group papers. SGP utilizes a relational database implemented with the PostgreSQL database management system. A full database diagram and documentation are available at https://github.com/ufarrell/sgp_phase1, and a simplified diagram is shown in Figure 2. The design was inspired by several existing data models in the geological and natural history museum communities. Tables for analytical geochemistry are from the British Geological Survey (BGS) geochemistry data model (Watson et al., 2014), with minor modifications. Tables for geological, geographical, and sample details are based on established museum collection management databases (Specify 6 https://www.specifysoftware.org/ and Arctos https://arctosdb.org/) in addition to the Observations Data Model 2 (ODM2, Horsburgh et al., 2016; Hsu et al., 2017), an information model for Earth observations. The SGP database is centered on the sample table (Figure 2). Samples are generally characterized by an individual rock sample and all resulting analyzed powders. The three key sections of the database linked to samples are (1) analytical results and associated methods, (2) geographical context, and (3) geological context. Dictionary tables (standardized lists of terms, also known as “controlled vocabularies”) are based on existing community vocabularies where possible (e.g., from EarthChem, ODM2, Macrostrat, U.S. Geological Survey (USGS), and BGS). However, in many cases, these vocabularies required additions, such as the inclusion of specific sedimentary geochemical experimental methods (e.g., sequential iron extraction techniques; Poulton & Canfield, 2005). The BGS data model for analytical methods and geochemical results has been adopted almost without modification. We store analytical data in their submitted or published format and do not standardize the results to any given unit. An analytical result may be empty (NULL) only if it is below or above detection limits, and those values are also stored if they are available. If the results are published, they are linked directly to a reference work on an individual basis so that a fine-level distinction can be made between published and related unpublished data from the same samples. Any geostandards that are analyzed alongside samples in a study are also recorded. In the SGP, we make every effort not to include the same result twice. However, replicates may legitimately be added if the same sample has undergone analysis for the same analyte more than once (this could include anything from true replicate analyses using the same methods in the same laboratory to analyses of the same sample by different research groups using different methods). We do not currently assign new sample identifiers to sub-samples. A parent–child relationship may be added in Phase 2 when the focus will expand to include carbonate data. The SGP welcomes contributions from any interested researchers. Specifically, contributing data automatically makes a researcher part of the SGP Collaborative Team, rather than one needing to “join” SGP to contribute data. In the first consortium-building stage, potential collaborators were targeted if their work was particularly relevant to the Phase 1 goals, and additional researchers were recruited via SGP representation at multiple conferences. SGP collaborators are involved in providing details about their samples and providing published data tables and unpublished data from their own archives. In addition, some data have been collected from relevant published studies where the authors are not directly involved. In such cases, contextual information was coded by SGP team members using information provided in the paper. SGP collaborators are asked to fill in a template with contextual information as completely as possible, but with an emphasis on key fields such as modern latitude and longitude, stratigraphic unit name, depositional environment, and lithology. A particularly important field is interpreted age, which is a numerical estimate for the age of each sample in millions of years (Ma). Whenever possible, the original authors, who are most familiar with the samples and stratigraphic sections, are asked to provide the interpreted age. They can use whatever method with which they feel most comfortable; for example, ages may be estimated based on assumed sedimentation rates and/or linear interpolation, or groups of samples can be assigned one age based on proximity to any available time markers. A brief justification is required for each age provided, which may be used in the future to refine ages further. Maximum and minimum age estimates can also be stored, and indeed, are critical for the type of re-weighted bootstrap analyses employed by many SGP working groups (Mehra et al., 2021). A subset of samples from two USGS databases has been integrated into the SGP database. The first of the databases used is the National Geochemical Database: Rock (USGS NGDB, U.S. Geological Survey, 2008), comprising data from USGS projects from the 1960s to1990s, largely from North America. The second is the Global Geochemical Database for Critical Metals in Black Shales project (USGS CMIBS, Granitto et al., 2017), which includes predominantly Phanerozoic shale data from all continents. Data from both USGS databases lack much of the contextual information available for samples directly coded by the SGP team members (most specifically basin type, metamorphic/maturity grade, depositional environment, and detailed age justification) and there are a higher proportion of analytes with less detailed geochemical methodology. Nevertheless, they represent large numbers of samples (74% of samples in Phase 1 are from USGS sources) with age, lithology, and geographic information that can be utilized for many types of analysis. In the case of USGS NGDB, only sedimentary samples were incorporated into SGP, and in the case of USGS CMIBS, we did not include samples with lithologies indicative of ore or studies where the authors were primarily concerned with mineral deposits or studying the effects of metamorphism on shales. An attempt was made to match USGS fields to SGP fields, with some data cleaning needed in order to extract important information such as up-to-date stratigraphic names. Samples can easily be traced back to the original USGS databases using their original identifiers. The USGS NGDB data were enhanced by adding interpreted ages. Samples were matched, using a combination of stratigraphy and location, to the continuous-time age model in Macrostrat (Peters et al., 2018). Specifically, the minimum and maximum age estimates from the Macrostrat model were entered, and the interpreted age was entered as the average of these values. Only samples with matched interpreted ages were included from USGS NGDB. The USGS CMIBS samples were associated with Macrostrat continuous-time age models where possible and given age information by SGP team members where not. However, a proportion (36%) remain without ages, and filling those in is a key goal for Phase 2. These three sources of data (direct entry by SGP team members (26% of samples), the CMIBS compilation (16% of samples), and the USGS NGDB (58% of samples)) provide a robust base platform for statistical analyses of aggregated sedimentary geochemical data through Earth history. Moving forward, we will continue direct entry from SGP team members, and work toward incorporating geochemical data compiled by additional geological surveys (for instance, incorporation of the OZCHEM whole-rock database from Geoscience Australia is currently in progress). Phase 1 of data collection ended in August 2019. A static version of the database was archived and made available to collaborators through the website (sgp-search.io) and via tailored queries. Time was allowed for vetting, and any errors discovered were corrected before the final freeze in February 2020. The Phase 1 data freeze includes 82,578 samples, with 2,701,236 analytical results, and was made public through our search website in December 2020. This paper should be cited in the future use of Phase 1 data downloads. More complete information on the Phase 1 data product can be found on the SGP wiki (https://github.com/ufarrell/sgp_phase1/wiki), including summaries by age, lithology, and geochemical methodology, as well as the specifics of how USGS databases were incorporated into the SGP structure. The SGP-contributed dataset includes 20,811 samples with 518,291 results. Approximately two thirds of the data (64%) come from 160 published sources (https://github.com/ufarrell/sgp_phase1/wiki/SGP-data-references). The remaining 36% are from unpublished sources, including new and legacy data. The samples come from 942 individual sites from 46 countries (Figure 3). Consistent with the Phase 1 goals, 84% of samples were from the Neoproterozoic–Paleozoic (Figure 4). Sixty-four percent of samples are fine-grained siliciclastic rocks (shale, mudstone, or siltstone), as are the majority of uncoded lithologies (Figure 5). The data from USGS NGDB that are incorporated into the SGP database include 48,234 samples with 1,769,696 results. Nearly all (99%) of the samples are from the United States. Nineteen percent are sandstone, 13% are shale, and 29% do not have a specific lithology (although lithological details may be available in verbatim fields; Figure 5). Contextual details, including depositional environment and low-grade metamorphic bin, are mostly not available for these samples, and methodological information is sparse. In general, the USGS NGDB samples skew younger than the SGP samples: 39% are from the Paleozoic, 25% from the Mesozoic, and 33% from the Cenozoic (~3% of samples are from the Proterozoic/Archean). The USGS database provides excellent coverage of the United States, but given the remit of the organization, with strong focus on economic deposits (petroleum-producing units, phosphatic units, and sedimentary mineral deposits), the sampling may not be representative of the entire country. This is distinct from the bias present in geochemical data produced by academic researchers, which are often focused on mass extinction intervals, Earth system perturbations, and other stratigraphic boundaries. The data incorporated from USGS CMIBS into the SGP database include 12,797 samples with 409,188 results. The samples are from 45 countries, with 40% from Canada, 27% from the United States, and 13% from Australia. The majority of samples are fine-grained siliciclastic sediments (69% shale, mudstone, siltstone, or argillite; Figure 5). Sixty percent of samples with interpreted ages are Paleozoic, 24% are Mesozoic, 2% are Cenozoic, and 15% are Proterozoic/Archean. As was the case for USGS NGDB, contextual details, including depositional environment and low-grade metamorphic bin, are often missing for these samples. However, more detailed geochemical methodological information is available. Each sample in CMIBS has a “best value” result per analyte, selected from multiple values that were originally available (Granitto et al., 2017). The choice of “best value” was made using a rubric which included consideration of the sample weight, the sample “decomposition” (e.g., full vs. partial acid digestion), the instruments used in the analysis, and the detection limits (Granitto et al., 2013). The SGP search website (sgp-search.io) utilizes an intuitive user interface to query the Phase 1 database via an API. The two main search types are “samples” and “analyses,” with “nhhxrf” simply being a “samples” search that excludes any handheld XRF (X-ray fluorescence) data. This methodological distinction is made because while handheld XRF data can be accurate for some elements (e.g., Ca and Fe), it is highly inaccurate for many others (e.g., S, Ni) (Rowe et al., 2012). Handheld XRF data represent 1% of the total results and 4% of SGP-contributed data; although this is a small percentage now, we anticipate continued growth given the popularity and utility of handheld XRFs. A “samples” search will list an individual sample on each row, with geological context information and geochemical analytes taking up the columns. Data are converted to one standard unit, and oxides are converted to elements (e.g., Al2O3 to Al), and values are averaged if more than one analysis was made per sample. Note, this search may average values produced using different analytical methods, although the number of samples in the database with multiple analytical values for a specific analyte is relatively small. Further, any analyses below or above detection limit are removed, as these cannot be averaged. This has implications for queries involving very low abundance elements (e.g., Ag in sedimentary rocks), as only results above detection limits, and thus higher values, will be included. We anticipate that this search will produce the optimal data output for most end-users interested in Earth history: a file with age, geological context, and geochemical data for each sample. If users are looking to delve deeper into the data and understand the analyses and procedures that were executed to obtain each sample's geochemical data, then the “analyses” search is useful because it lists every analysis recorded in the database in a separate row. The “analyses” search also allows users to show data relating to the laboratory where the sample was analyzed, the person who made the measurement, geochemical methodology, etc. At the current time, aside from the ability to exclude handheld XRF data, the “samples” and “nhhxrf” search types will not report information about, or have the ability to filter by, geochemical methodology. Users who are interested in methodological details or who would like to export a data file beyond the size limit (10 Mb) should contact the SGP Leadership Team regarding a custom SQL query. Once the user has selected a search type, samples can be filtered based on both geological context and geochemical attributes. Note that for many samples some aspects of geological contextual information are incomplete. Thus, for example, a search filtering for samples deposited in a rift basin will only return samples positively described as such and not necessarily all samples in the database deposited in rift basins. Given that samples will have non-overlapping missing data, too many filters may result in a smaller-than-expected dataset. Search results will appear in a “preview” window that can be used to check the output. Each sample also has an information icon associated with it; clicking this icon will bring up a lightbox with detailed sample information. Finally, the user may request to show reference information for their search. For “analyses” searches (where every analysis is shown as an individual row), this will return the specific literature citation for that individual analytic result. For other search types, this will return, for every sample, a concatenated list of all references whose geochemical data contributed to that specific search. When the user is satisfied with their search, they can then download a.csv file of the data and export a map showing the location and age of samples in their search. Thus, an example API call would be {"type":"samples","filters":{"country":["Argentina","Brazil","Chile","Bolivia","Colombia","Venezuela"],"toc":[2,100]},"show":["toc","fe","height_meters","section_name","country","interpreted_age"]}. This API call is making a “samples” type search for samples that originate from Argentina, Brazil, Chile, Bolivia, Colombia, or Venezuela and have 2%–100% total organic carbon (TOC) content. In other words, searching for organic-rich samples from South America. In addition, the API call is asking for a results output table with columns that show TOC (wt%), Fe (wt%), section or core name, collection height in meters, each sample's country, and the age in millions of years. Full documentation and a tutorial video are available on the website. The overarching goal of SGP was to provide intellectual and geoinformatic resources for the Earth Science community to advance our understanding of environmental changes on Earth through time. A better understanding of Earth's history requires sufficient data density, but equally importantly it means training a new generation of researchers with the data science and statistical skills to make meaningful conclusions from large sedimentary geochemical datasets. Much of the focus in SGP Phase 1 was in initiating the consortium and increasing the data product to the point where it was useful for analyses by the community. We now aim to increasingly move toward developing a community-initiated set of best practices for data management, a culture of publishing metadata, and a shared intellectual framework for analyzing such datasets. Over the course of Phase 2, we plan to continue holding annual meetings at Goldschmidt while also beginning regular video calls to share progress and ideas for data analysis. We will also develop accessible "Proxy Primer" videos to help the geobiological community understand the strengths and weaknesses of different proxies. Echoing this final point, we reiterate that the SGP is a community-oriented research consortium, and we welcome suggestions on how to best move toward our shared goals. We thank Sufian Lattouf for developing the initial version of the SGP website, and Kai Lenz, Kassie Sharp, Aaron Cole, Clare Swan, Lyna Kim, and John Freshwaters for computational assistance. We thank Erin Saupe, Itay Halevy, Jordon Hemingway, Minming Cui, Maya Gomes, Matthew Granitto, Alf Lenz, Charles Henderson, Chengsheng Jin, Clint Scott, David Champion, Jinghai Yang, Joe Shaffer, Kathy Doyle, Lei Xiang, Liam Bhajan, Patrick Sack, Paul Hoffman, Paulo Linarde Dantas Mascena, Will Thompson-Butler, and Yu Liu for their contributions to SGP. We thank Patrick Sullivan and Laramie Duncan for discussions regarding the PGC and research consortium organization. We thank the donors of The American Chemical Society Petroleum Research Fund for partial support of SGP website development (61017-ND2). EAS is funded by National Science Foundation grant (NSF) EAR-1922966. BGS authors (JE, PW) publish with permission of the Executive Director of the British Geological Survey, UKRI. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government. The authors declare no conflicts of interest.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call