Curation accuracy of model organism databases.

I M Keseler,M Skrzypek,G Sherlock,E D Chow,C Fulcher,K M Mladinich,K C Lemmer,G.-W Li,P D Karp,A Y Chen,D Weerasinghe

doi:10.1093/database/bau058

I M Keseler, M Skrzypek + Show 9 more

Open Access

https://doi.org/10.1093/database/bau058

Copy DOI

Abstract

Manual extraction of information from the biomedical literature—or biocuration—is the central methodology used to construct many biological databases. For example, the UniProt protein database, the EcoCyc Escherichia coli database and the Candida Genome Database (CGD) are all based on biocuration. Biological databases are used extensively by life science researchers, as online encyclopedias, as aids in the interpretation of new experimental data and as golden standards for the development of new bioinformatics algorithms. Although manual curation has been assumed to be highly accurate, we are aware of only one previous study of biocuration accuracy. We assessed the accuracy of EcoCyc and CGD by manually selecting curated assertions within randomly chosen EcoCyc and CGD gene pages and by then validating that the data found in the referenced publications supported those assertions. A database assertion is considered to be in error if that assertion could not be found in the publication cited for that assertion. We identified 10 errors in the 633 facts that we validated across the two databases, for an overall error rate of 1.58%, and individual error rates of 1.82% for CGD and 1.40% for EcoCyc. These data suggest that manual curation of the experimental literature by Ph.D-level scientists is highly accurate.Database URL: http://ecocyc.org/, http://www.candidagenome.org//

Highlights

Model Organism Databases (MODs) have become tightly woven into the fabric of modern life science research
MOD accuracy is important because MODs are used daily by thousands of scientists as online encyclopedias and to help interpret new experimental data in the context of existing knowledge
While checking the errors reported by the validators, we found several cases that we considered validation errors

Summary

Introduction

Model Organism Databases (MODs) have become tightly woven into the fabric of modern life science research. Ph.D-level biologists read scientific publications, extract key facts from these publications and enter these facts into both structured and unstructured fields in MODs. Manual curation has been widely assumed highly accurate under the idea that PhD-level biocurators can understand and accurately interpret the life science literature, and correctly transcribe the facts that they read. The data captured in MODs are used to develop gold standards for training and evaluating predictive algorithms in bioinformatics. When developing algorithms for predicting promoters, operons or protein–protein interactions, bioinformaticists use MODs as sources of reference data sets to evaluate and optimize the accuracy of their algorithms [1,2,3]. MODs have received millions of dollars in government funding over the previous 20 years and are widely supported by their respective communities, yet limited data exist regarding their accuracy

Objectives

Methods

Results