Automated assessment of biological database assertions using the scientific literature

Mohamed Reda Bouadjenek,Justin Zobel,Karin Verspoor

doi:10.1186/s12859-019-2801-x

Mohamed Reda Bouadjenek, Justin Zobel + Show 1 more

Open Access

https://doi.org/10.1186/s12859-019-2801-x

Copy DOI

Journal: BMC bioinformatics	Publication Date: Apr 29, 2019
Citations: 3	License type: open-access

Affiliation: University of Toronto, University of Melbourne

Abstract

BackgroundThe large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct.ResultsOur experiments on assessing gene–disease relations and protein–protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents.ConclusionsBARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.

Highlights

The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature
To demonstrate the scale of the challenge we address in this paper, consider Fig. 2, which shows the distribution of literature co-mentions of correct or incorrect gene–disease relations and correct or incorrect protein–protein interactions, where correctness is determined based on human-curated relational data
Given a new biological assertion represented as a relation, Biocuration tool for assessment of relation consistency (BARC) first retrieves a list of documents that may be used to check the correctness of that relation

Summary

Introduction

The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct. The large biological databases are a foundational, critical resource in both biomedical research and, increasingly, clinical health practice. These databases, typified by GenBank and UniProt, represent our collective knowledge of DNA and RNA sequences, genes, proteins, and other kinds of biological entities. PubMed3 [3], as the primary index of biomedical research publications, is typically consulted for this purpose

Objectives

Methods

Results

Conclusion