Abstract

DiSSCo, the Distributed System of Scientific Collections, is a pan-European Research Infrastructure (RI) mobilising, unifying bio- and geo-diversity information connected to the specimens held in natural science collections and delivering it to scientific communities and beyond. Bringing together 120 institutions across 21 countries and combining earlier investments in data interoperability practices with technological advancements in digitisation, cloud services and semantic linking, DiSSCo makes the data from natural science collections available as one virtual data cloud, connected with data emerging from new techniques and not already linked to specimens. These new data include DNA barcodes, whole genome sequences, proteomics and metabolomics data, chemical data, trait data, and imaging data (Computer-assisted Tomography (CT), Synchrotron, etc.), to name but a few; and will lead to a wide range of end-user services that begins with finding, accessing, using and improving data. DiSSCo will deliver the diagnostic information required for novel approaches and new services that will transform the landscape of what is possible in ways that are hard to imagine today. With approximately 1.5 billion objects to be digitised, bringing natural science collections to the information age is expected to result in many tens of petabytes of new data over the next decades, used on average by 5,000 – 15,000 unique users every day. This requires new skills, clear policies and robust procedures and new technologies to create, work with and manage large digital datasets over their entire research data lifecycle, including their long-term storage and preservation and open access. Such processes and procedures must match and be derived from the latest thinking in open science and data management, realising the core principles of 'findable, accessible, interoperable and reusable' (FAIR). Synthesised from results of the ICEDIG project ("Innovation and Consolidation for Large Scale Digitisation of Natural Heritage", EU Horizon 2020 grant agreement No. 777483) the DiSSCo Conceptual Design Blueprint covers the organisational arrangements, processes and practices, the architecture, tools and technologies, culture, skills and capacity building and governance and business model proposals for constructing the digitisation infrastructure of DiSSCo. In this context, the digitisation infrastructure of DiSSCo must be interpreted as that infrastructure (machinery, processing, procedures, personnel, organisation) offering Europe-wide capabilities for mass digitisation and digitisation-on-demand, and for the subsequent management (i.e., curation, publication, processing) and use of the resulting data. The blueprint constitutes the essential background needed to continue work to raise the overall maturity of the DiSSCo Programme across multiple dimensions (organisational, technical, scientific, data, financial) to achieve readiness to begin construction. Today, collection digitisation efforts have reached most collection-holding institutions across Europe. Much of the leadership and many of the people involved in digitisation and working with digital collections wish to take steps forward and expand the efforts to benefit further from the already noticeable positive effects. The collective results of examining technical, financial, policy and governance aspects show the way forward to operating a large distributed initiative i.e., the Distributed System of Scientific Collections (DiSSCo) for natural science collections across Europe. Ample examples, opportunities and need for innovation and consolidation for large scale digitisation of natural heritage have been described. The blueprint makes one hundred and four (104) recommendations to be considered by other elements of the DiSSCo Programme of linked projects (i.e., SYNTHESYS+, COST MOBILISE, DiSSCo Prepare, and others to follow) and the DiSSCo Programme leadership as the journey towards organisational, technical, scientific, data and financial readiness continues. Nevertheless, significant obstacles must be overcome as a matter of priority if DiSSCo is to move beyond its Design and Preparatory Phases during 2024. Specifically, these include: Organisational: Strengthen common purpose by adopting a common framework for policy harmonisation and capacity enhancement across broad areas, especially in respect of digitisation strategy and prioritisation, digitisation processes and techniques, data and digital media publication and open access, protection of and access to sensitive data, and administration of access and benefit sharing. Pursue the joint ventures and other relationships necessary to the successful delivery of the DiSSCo mission, especially ventures with GBIF and other international and regional digitisation and data aggregation organisations, in the context of infrastructure policy frameworks, such as EOSC. Proceed with the explicit aim of avoiding divergences of approach in global natural science collections data management and research. Strengthen common purpose by adopting a common framework for policy harmonisation and capacity enhancement across broad areas, especially in respect of digitisation strategy and prioritisation, digitisation processes and techniques, data and digital media publication and open access, protection of and access to sensitive data, and administration of access and benefit sharing. Pursue the joint ventures and other relationships necessary to the successful delivery of the DiSSCo mission, especially ventures with GBIF and other international and regional digitisation and data aggregation organisations, in the context of infrastructure policy frameworks, such as EOSC. Proceed with the explicit aim of avoiding divergences of approach in global natural science collections data management and research. Technical: Adopt and enhance the DiSSCo Digital Specimen Architecture and, specifically as a matter of urgency, establish the persistent identifier scheme to be used by DiSSCo and (ideally) other comparable regional initiatives. Establish (software) engineering development and (infrastructure) operations team and direction essential to the delivery of services and functionalities expected from DiSSCo such that earnest engineering can lead to an early start of DiSSCo operations. Adopt and enhance the DiSSCo Digital Specimen Architecture and, specifically as a matter of urgency, establish the persistent identifier scheme to be used by DiSSCo and (ideally) other comparable regional initiatives. Establish (software) engineering development and (infrastructure) operations team and direction essential to the delivery of services and functionalities expected from DiSSCo such that earnest engineering can lead to an early start of DiSSCo operations. Scientific: Establish a common digital research agenda leveraging Digital (extended) Specimens as anchoring points for all specimen-associated and -derived information, demonstrating to research institutions and policy/decision-makers the new possibilities, opportunities and value of participating in the DiSSCo research infrastructure. Establish a common digital research agenda leveraging Digital (extended) Specimens as anchoring points for all specimen-associated and -derived information, demonstrating to research institutions and policy/decision-makers the new possibilities, opportunities and value of participating in the DiSSCo research infrastructure. Data: Adopt the FAIR Digital Object Framework and the International Image Interoperability Framework as the low entropy means to achieving uniform access to rich data (image and non-image) that is findable, accessible, interoperable and reusable (FAIR). Develop and promote best practice approaches towards achieving the best digitisation results in terms of quality (best, according to agreed minimum information and other specifications), time (highest throughput, fast), and cost (lowest, minimal per specimen). Adopt the FAIR Digital Object Framework and the International Image Interoperability Framework as the low entropy means to achieving uniform access to rich data (image and non-image) that is findable, accessible, interoperable and reusable (FAIR). Develop and promote best practice approaches towards achieving the best digitisation results in terms of quality (best, according to agreed minimum information and other specifications), time (highest throughput, fast), and cost (lowest, minimal per specimen). Financial Broaden attractiveness (i.e., improve bankability) of DiSSCo as an infrastructure to invest in. Plan for finding ways to bridge the funding gap to avoid disruptions in the critical funding path that risks interrupting core operations; especially when the gap opens between the end of preparations and beginning of implementation due to unsolved political difficulties. Broaden attractiveness (i.e., improve bankability) of DiSSCo as an infrastructure to invest in. Plan for finding ways to bridge the funding gap to avoid disruptions in the critical funding path that risks interrupting core operations; especially when the gap opens between the end of preparations and beginning of implementation due to unsolved political difficulties. Strategically, it is vital to balance the multiple factors addressed by the blueprint against one another to achieve the desired goals of the DiSSCo programme. Decisions cannot be taken on one aspect alone without considering other aspects, and here the various governance structures of DiSSCo (General Assembly, advisory boards, and stakeholder forums) play a critical role over the coming years.

Highlights

  • Significant obstacles must be overcome as a matter of priority if Distributed System of Scientific Collections (DiSSCo) is to move beyond its Design and Preparatory Phases during 2024

  • Decisions cannot be taken on one aspect alone without considering other aspects, and here the various governance structures of DiSSCo (General Assembly, advisory boards, and stakeholder forums) play a critical role over the coming years

  • Numerous (104) recommendations have been made to be considered by other elements of the DiSSCo Programme of linked projects i.e., SYNTHESYS+, COST MOBILISE, DiSSCo Prepare, and others to follow, and the DiSSCo Programme leadership as the journey towards organisational, technical, scientific, data and financial readiness continues

Read more

Summary

Background

It is vital to balance multiple factors – technical and engineering, organisational and political, financial and legal, and operational and governance – against one another to achieve the desired goals of the DiSSCo programme. With approximately 1.5 billion objects to be digitized, bringing natural science collections to the information age is expected to result in 90 petabytes of new data over the decades, used on average by 5,000–15,000 unique users every day This requires new skills, clear policies and robust procedures to create, work with and manage large digital datasets over their entire research data lifecycle, including their long-term storage and preservation and open access. We mean digitizing entire collections or their major distinct parts at industrial scale (i.e., millions of objects annually at low cost (e.g., < c.€0.50 per item), characterised by improved workflows, technological and procedural frameworks based on automation (both hardware and software) and enrichment (link-building) This is critical within DiSSCo to mobilise the data from collections as rapidly as possible, so that these data can be more found and used; and can act as an anchor or ‘keyring’ for other data.

Structure of the document
Key terms
References to other documents
Rationale for a Distributed System of Scientific Collections
Accelerating beyond the current situation
Disconnected infrastructure
Industrialising digitization
Understanding digitization
International landscape and DiSSCo positioning
Innovations and consolidations identified by ICEDIG
Overall approach and direction
The FAIR Guiding Principles
The provisional Data Management Plan for DiSSCo infrastructure
Role and development of a common digital research agenda
Common policy elements
Participation of citizen science
Types of partnering
Customer-supplier relationships
Stakeholder investment
Organising alliances
Identified strategic opportunities for DiSSCo
DiSSCo Centres of Excellence
Role of the private sector and options for public procurement
Generic data storage and computation services
EOSC and the FAIR Digital Object Framework
Implications for establishment of research infrastructures
Early stage arrangements
More formal arrangements
Importance of correct legal status
Implications for data management practices
Open research data
Characteristics
Factors influencing digitization choices
Affordability and achievability
Organising mass digitization
Inhouse vs outsourced
Specialisation
Small and private collections
Decentralisation
DiSSCo’s role
Warehousing
Centres of Excellence for harmonising approaches in DiSSCo
3.10.1. Characteristics
3.10.2. A framework of prioritisation criteria
3.10.3. Offering digitization-on-demand for selected specimens
3.10.4. Pulling out selected specimens
3.10.5. Collection Digitization Dashboard
3.11. Incentives for digitization of private collections
3.12.1. Software sustainability and maintenance
3.12.1.2. Investing for software development
3.12.1.4. Timing of development
3.12.1.5. Common design standards
3.12.2.1. Resilience in a community endeavour
3.12.2.2. DiSSCo Technical Team
Technical concept for data management
Principal components of DSArch
The FAIR Guiding Principles and FAIR Digital Objects
Evolutionary architecture with protected characteristics
Action steps and phasing
Hub infrastructure
Data coupling
Beyond the initial phases
Bringing technical innovations to required readiness level
NSId PID scheme
Open access guidelines
Minimum information standards
Use of public data repositories
Service portfolio management
Digitization design alternatives
Available options
Improving manual transcription
Field notebooks
Georeferencing
Automated text digitization
The definition of quality
Digitization quality management as preventative work
Data quality improvement as curational work
Image quality management
Use of automation and robotics
Long-term data preservation alternatives
Reproducible research through research objects
Types of collections being digitized
Current digitization efforts
Tendency towards on-site digitization
Specialist digitization teams
Internal documentation and tracking
Cultural differences
The limitations of current capacities to perform digitization
Limitations in resources and funding
Digitization becoming business as usual
Effect on collaboration and research
Effect on mobility of collections
Re-organising work
Interoperability of Collection Management Systems with DiSSCo Hub
Support for automated data capture in Darwin Core data standard
Dealing with blank fields in Darwin Core data standard
Data about people
Geography
Data migration
Costing as a new practice
Costs versus charges
Training and working better together
Awareness raising and promotion in the Preparatory Phase
Governance of the DiSSCo Programme
Requirements for a new model
General Assembly
Membership
Decision-making
Coordination and Support Office
Advisory Bodies
Coordination Bodies
The critical funding path for DiSSCo
Criteria influencing national funding commitment towards DiSSCo
Direct funding model option
Diversification of funding streams
National funding frameworks
Consolidating national funding – The hourglass model
RI cluster funding
Governmental securities – shared liability
Conclusions
Glossary of terms and abbreviations
Generic guidelines
Requirements for FDOF
Findings
FDOF glossary
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call