Metadata Enrichment Research Articles

Subject indexing, i.e., the enrichment of metadata records for textual resources with descriptors from a controlled vocabulary, is one of the core activities of libraries. Due to the proliferation of digital documents, it is no longer possible to annotate every single document intellectually, which is why we need to explore the potentials of automation on every level. At ZBW the efforts to partially or completely automate the subject indexing process started as early as 2000 with experiments involving external partners and commercial software. The conclusion of that first exploratory period was that commercial, supposedly shelf-ready solutions would not suffice to cover the requirements of the library. In 2014 the decision was made to start doing the necessary applied research in-house which was successfully implemented by establishing a PhD position. However, the prototypical machine learning solutions that they developed over the following years were yet to be integrated into productive operations at the library. Therefore in 2020 an additional position for a software engineer was established and a pilot phase was initiated (planned to last until 2024) with the goal to complete the transfer of our solutions into practice by building a suitable software architecture that allows for real-time subject indexing with our trained models and the integration thereof into the other metadata workflows at ZBW. In this paper we address the question of how to transfer results from applied research into a productive service, and we report on the milestones we have reached so far and on those that are yet to be reached on an operational level. We also discuss the challenges we were facing on a strategic level, the measures and resources (computing power, software, personnel) that were needed in order to be able to affect the transfer, and those that will be necessary in order to subsequently ensure the continued availability of the architecture and to enable a continuous development during running operations. We conclude that there are still no shelf-ready open source systems for the automation of subject indexing – existing software has to be adapted and maintained continuously which requires various forms of expertise. However, the task of automation is here to stay, and librarians are witnessing the dawn of a new era where subject indexing is done at least in part by machines, and the respective roles of machines and human experts may shift even further and more rapidly in a not-so-distant future. We argue that in general, the format of “project” and the mindset that goes with it may not suffice to secure the commitment that an institution and its decision-makers and the library community as a whole will have to bring to the table in order to face the monumental task of the digital transformation and automation in the long run. We also highlight the importance of all parties – applied researchers, software engineers, stakeholders – staying involved and continuously communicating requirements and issues back and forth in order to successfully create and establish a productive service that is suitable and equipped for operation.

Read full abstract

The advancements in sequencing technologies have greatly contributed to the documentation of Earth’s biodiversity. However, for exploring the full potential of molecular resources for biodiversity, there needs to be a good linkage between sequence data and its biological source, contributing to a network of connected data in the biodiversity research cycle. This requires a foundation of well-structured and accessible annotations in the molecular sequence repositories. The International Nucleotide Sequence Database Collaboration (INSDC), of which the European Nucleotide Archive (ENA) is its European node, holds a large amount of annotations associated with sequence data, relating to its biological source (e.g., specimens in natural history collections). However, for a number of records, these annotations may be incomplete (e.g., missing voucher information), ambiguous or even inaccurate. Therefore, we have implemented a workflow that allows third-party annotations to be attached to sequence and sample records using two existing services, the PlutoF platform and the ELIXIR Contextual Data ClearingHouse. This work was developed within the scope of the BiCIKL (Biodiversity Community Integrated Knowledge Library) project, which aims to establish open science practices in the biodiversity domain. PlutoF is an online data management platform that also provides computing services for biology-related research. PlutoF features allow registered users to enter their own data and access public data at INSDC. Users can enter and manage a range of data, as taxonomic classifications, occurrences, etc. This platform also includes a module that allows the addition of third-party annotations (on material source, taxonomic identification, etc.) linked to specimens or sequence records. This module was already in use by the UNITE community for annotation of INSDC rDNA Internal Transcribed Spacer sequence datasets (Abarenkov et al. 2021). These UNITE annotations are displayed in the National Centre for Biotechnology Information (NCBI) records through links to the PlutoF platform. However, there was the need for an automated solution that allowed third-party annotations to any sequence or sample record at INSDC. This was implemented through the operation of the ELIXIR Contextual Data ClearingHouse (hereafter as Clearinghouse). The Clearinghouse holds a simple RESTful Application Programming Interface (API) to support the submission of additions and improvements to current metadata attributes, such as information on material sources, on records publicly available in the ELIXIR data resources. The Clearinghouse enables the submission of these corrected metadata from databases (such as the PlutoF platform) to the primary data repositories. The workflow developed is shown in Fig. 1 and consists of the following steps: i) users annotate sequence metadata that is regularly downloaded from INSDC using NCBI’s E-utilities; ii) an annotation proposal is created and a verification notification is sent to an assigned reviewer; iii) the reviewer evaluates the annotation proposal and accepts it or rejects it with comments; iv) if the annotation proposal is accepted, the annotated fields that may be mapped to ENA fields are then pushed to the Clearinghouse using their RESTful API. The annotations when received at ENA are then reviewed before being displayed. This workflow is implemented through a web interface in PlutoF, which allows user-friendly and effortless reporting of corrections or additions to biological source metadata in sequence records. Overall, we expect this tool to contribute to the enrichment of metadata associated with sequence records, and therefore increase the links between the molecular and biodiversity resources, and enable sequencing data to deliver their full potential for biodiversity conservation.

Read full abstract

Metadata Enrichment Research Articles

Related Topics

Articles published on Metadata Enrichment

Identifying genomic data use with the Data Citation Explorer

OpenCitations Meta

Cultural heritage on the Semantic Web: The Europeana Data Model

OAVA: the open audio-visual archives aggregator

RADio* – An Introduction to Measuring Normative Diversity in News Recommendations

Automating subject indexing at ZBW

Improving FAIRness of eDNA and Metabarcoding Data: Standards and tools for European Nucleotide Archive data deposition

Wikibase Model for Premodern Manuscript Metadata Harmonization, Linked Data Integration, and Discovery

THE CORPORA-ORIENTED PROJECTS AND COURSES – INNOVATION OF THE UNIVERSITY LIFE

ENHANCED FINDABILITY AND REUSABILITY OF ENGINEERING DATA BY CONTEXTUAL METADATA

DETEXA: declarative extensible text exploration and analysis through SQL

Cross-portal metadata alignment – Connecting open data portals through means of formal concept analysis

Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters.

Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections

Entification of Metadata for Knowledge‐Graph‐Guided Discovery

Paving the way for enriched metadata of linguistic linked data

Enabling Community Curation of Biological Source Annotations of Molecular Data Through PlutoF and the ELIXIR Contextual Data Clearinghouse

Towards Connecting Molecular Data and the Biodiversity Research Community: An ENA and ELIXIR biodiversity community perspective

New Approaches Towards the Delivery of Service Information Using Semantic Correlation Rules

Re-imagining (black) comic book cataloguing: increasing accessibility through metadata at one university library

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Metadata Enrichment Research Articles

Related Topics

Articles published on Metadata Enrichment

Identifying genomic data use with the Data Citation Explorer

OpenCitations Meta

Cultural heritage on the Semantic Web: The Europeana Data Model

OAVA: the open audio-visual archives aggregator

RADio* – An Introduction to Measuring Normative Diversity in News Recommendations

Automating subject indexing at ZBW

Improving FAIRness of eDNA and Metabarcoding Data: Standards and tools for European Nucleotide Archive data deposition

Wikibase Model for Premodern Manuscript Metadata Harmonization, Linked Data Integration, and Discovery

THE CORPORA-ORIENTED PROJECTS AND COURSES – INNOVATION OF THE UNIVERSITY LIFE

ENHANCED FINDABILITY AND REUSABILITY OF ENGINEERING DATA BY CONTEXTUAL METADATA

DETEXA: declarative extensible text exploration and analysis through SQL

Cross-portal metadata alignment – Connecting open data portals through means of formal concept analysis

Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters.

Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections

Entification of Metadata for Knowledge‐Graph‐Guided Discovery

Paving the way for enriched metadata of linguistic linked data

Enabling Community Curation of Biological Source Annotations of Molecular Data Through PlutoF and the ELIXIR Contextual Data Clearinghouse

Towards Connecting Molecular Data and the Biodiversity Research Community: An ENA and ELIXIR biodiversity community perspective

New Approaches Towards the Delivery of Service Information Using Semantic Correlation Rules

Re-imagining (black) comic book cataloguing: increasing accessibility through metadata at one university library