The availability of accurate and up-to-date information on marine species—ranging from nomenclature and taxonomy to species occurrences—is fundamental for several stakeholders, from marine scientists to policymakers. Today, this information is provided by well-known authoritative systems, including the World Register of Marine Species (WoRMS*1), a comprehensive register of names of marine organisms and related information; Marine Regions*2, a register of georeferenced information on marine place names and areas; and the European Ocean Biogeographic Information System (EurOBIS*3), where species occurrences in European waters are recorded. These platforms ensure marine biodiversity data is accessible and interoperable, supporting various applications and use cases for fostering advancements in taxonomy, marine biology, ecology and environmental science. The MAREGRAPH project*4 has the ambition to further increase the findability, accessibility, interoperability and reusability of these foundational high-value datasets, by creating and publishing an open knowledge graph (KG) on marine biodiversity data through the semantic uplifting of the involved datasets, relying on the Semantic Web stack and its standards (Fig. 1). This requires: the definition and publication of a network of formal, reusable and extensible ontologies and controlled vocabularies for representing and linking together marine biodiversity information, as a basis for semantic interoperability; the set up of a cost-effective, flexible and scalable data architecture to publish the data as Linked Open Data and provide access through standard APIs, from a SPARQL endpoint to Linked Data Event Streams (LDES), enabling technical interoperability and serving various use cases and data access needs. the definition and publication of a network of formal, reusable and extensible ontologies and controlled vocabularies for representing and linking together marine biodiversity information, as a basis for semantic interoperability; the set up of a cost-effective, flexible and scalable data architecture to publish the data as Linked Open Data and provide access through standard APIs, from a SPARQL endpoint to Linked Data Event Streams (LDES), enabling technical interoperability and serving various use cases and data access needs. The methodology adopted for defining and publishing the KG combines the consolidated principles of the Open Standards for Linked Organisations (OSLO) framework—which defines the governance structure and a open process for developing semantic data standards—with the well-established agile and collaborative ontology development workflow of the eXtreme Design methodology (Presutti 2009), based on ontology design patterns (ODPs). Also leveraging the experience gained in defining the Marine Regions Ontology and publishing the Marine Regions gazetteer as LDES (Lonneville 2021), we are iteratively defining the ontologies for representing: marine species, whose description is rooted in the definition of taxa and their scientific names, and extends to geographic distribution, scientific literature, ecological traits and other biological information as represented in WoRMS; species observations, describing the occurrence of species in space and time, with associated descriptive data as well as biological and environmental measurements, as recorded in EurOBIS. marine species, whose description is rooted in the definition of taxa and their scientific names, and extends to geographic distribution, scientific literature, ecological traits and other biological information as represented in WoRMS; species observations, describing the occurrence of species in space and time, with associated descriptive data as well as biological and environmental measurements, as recorded in EurOBIS. Our approach to ontology design is driven by the identification of use cases and competency questions, as well as by existing models and data, also involving the community through workshops, webinars and co-creation sessions. Existing domain ontologies and reference data models, such as Biodiversity Information Standards' (TDWG) Darwin Core, the Taxonomic Concept Schema, Bioschema's profiles for taxa and taxon names, the OpenBioDiv ontology, and the Catalogue of Life Data Package (CoLDP), were considered for the identification of ODPs and reuse, particularly through ontology semantic alignments. As fostering consensus on data semantics is a critical step toward broad acceptance and adoption, semantic assets are incrementally published on a dedicated GitHub repository*5, involving again the community through public review processes. MAREGRAPH will then enable the production of linked open datasets where the data from WoRMS, Marine Regions and EurOBIS are seamlessly integrated and interlinked in a unified KG that can be further enriched and linked with data from other initiatives where marine biodiversity is the focus.
Read full abstract