Understanding life sciences data curation practices via user research

Aravind Venkatesan,Nikiforos Karamanis,Michele Ide-Smith,Jonathan Hickford,Johanna Mcentyre

doi:10.12688/f1000research.19427.1

Aravind Venkatesan, Nikiforos Karamanis + Show 3 more

Open Access

https://doi.org/10.12688/f1000research.19427.1

Copy DOI

Journal: F1000Research	Publication Date: Sep 11, 2019
Citations: 5	License type: CC BY 4.0

Affiliation: European Bioinformatics Institute

Abstract

Background: Manual curation is a cornerstone of public biological data resources. However, it is a time-consuming process that urgently needs supportive technical solutions in the face of rapid data growth. Supporting scalable curation is a part of the mission of the Elixir Data Platform. Thus far, we have established infrastructure capable of ingesting and aggregating text-mined outputs from multiple providers and making these available via an API. This public API is used by Europe PMC to display specific entities and relationships on full text articles (via the SciLite application). Methods: To ensure that the future development of this infrastructure meets the needs of curators, we carried out a user research project to understand and identify common workflow patterns and practices via an observational study. Building on these outcomes, we then devised a curator community survey to more specifically understand which entity types, sections of a paper and tools are of top priority to address. Results: The main challenges faced by curators included the following: a) There is a need for ways to prioritise and identify relevant papers for curation as the volume of literature is large; b) Finding specific information can prove difficult; quick ways of filtering articles based on specific entities, such as experimental methods, species and other important entities, such as genes, cell lines and tissue samples, are required; and c) Transferring information from the search/annotation tools to the various curation workflows was also challenging. Conclusions: This study lays the foundation for identifying actionable items to orient the current infrastructure towards meeting the needs of curation community, by improving text-mined annotation quality and coverage and other engineering solutions; and reusing text-mined annotations and other metadata in Europe PMC for article triage. Furthermore, this study presents an opportunity to explore customisation of triage/ranking systems to suit different curation contexts.

Highlights

Biological databases play a key role in knowledge discovery in life science research
In this report we describe the outcomes of a user research project, conducted to understand curation practices and priorities for article selection
Contributions made by manual curation are vital to the maintenance of biological databases

Summary

Introduction

Biological databases play a key role in knowledge discovery in life science research. Machine learning and analytics promise to provide better ranking of reading lists, classification of articles, and identification of assertions with their biological context and evidence buried within the text of articles To this end, many life science knowledgebases include text mining (to varying degrees) in curation workflows. Methods: To ensure that the future development of this infrastructure meets the needs of curators, we carried out a user research project to understand and identify common workflow patterns and practices via an observational study Building on these outcomes, we devised a curator community survey to understand which entity types, sections of a paper and tools are of top priority to address. Conclusions: This study lays the foundation for identifying actionable items to orient the current infrastructure towards meeting the needs of curation community, by improving text-mined annotation quality and coverage and other engineering solutions; and reusing text-

Objectives

Methods

Results

Conclusion