Abstract

Updating genome databases to reflect newly published molecular findings for an organism was hard enough when only a single strain of a given organism had been sequenced. With multiple sequenced strains now available for many organisms, the challenge has grown significantly because of the still-limited resources available for the manual curation that corrects errors and captures new knowledge. We have developed a method to automatically propagate multiple types of curated knowledge from genes and proteins in one genome database to their orthologs in uncurated databases for related strains, imposing several quality-control filters to reduce the chances of introducing errors. We have applied this method to propagate information from the highly curated EcoCyc database for Escherichia coli K–12 to databases for 480 other Escherichia coli strains in the BioCyc database collection. The increase in value and utility of the target databases after propagation is considerable. Target databases received updates for an average of 2,535 proteins each. In addition to widespread addition and regularization of gene and protein names, 97% of the target databases were improved by the addition of at least 200 new protein complexes, at least 800 new or updated reaction assignments, and at least 2,400 sets of GO annotations.

Highlights

  • Manual curation of biological databases is a time-consuming and moderately expensive (Karp, 2016b) task, requiring biological expertise, attention to detail, and the ability to sift through and evaluate the experimental literature

  • Most Pathway/Genome Database (PGDB) received updated reaction assignments for 900–1,100 proteins. This set includes both cases in which new or different reaction assignments were added to a gene product and cases in which spurious reaction assignments were removed

  • Because the protein belongs to a larger family of haloacid dehalogenase hydrolases, in many PGDBs it was previously assigned to a set of dehalogenase reactions

Read more

Summary

Introduction

Manual curation of biological databases is a time-consuming and moderately expensive (Karp, 2016b) task, requiring biological expertise, attention to detail, and the ability to sift through and evaluate the experimental literature. The outcome of all this applied effort and expertise is that expert manual curation remains the gold standard of database quality (Keseler et al, 2014). Cheaper automated text-mining systems, while suitable for certain limited applications, are not yet capable of making the determinations required to populate rich, complex, multi-datatype databases such as those in the BioCyc collection (Karp, 2016a). We describe an automated method that propagates curated information from one Pathway/Genome Database (PGDB) to other PGDBs within BioCyc. We have applied the method to propagate curation from the EcoCyc database to databases for 480 other E. coli strains in BioCyc, thereby leveraging limited curation resources to greatly increase the value of BioCyc

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.