Abstract

IMG ER: A System for Microbial Genome Annotation Expert Review and Curation Victor M. Markowitz 1, *, Konstantinos Mavromatis 2 , Natalia N. Ivanova 2 , I-Min A. Chen 1 , Ken Chu 1 , and Nikos C. Kyrpides 2 Biological Data Management and Technology Center, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA Genome Biology Program , DOE Joint Genome Institute, 2800 Mitchell Dr., Walnut Creek, CA 94598, USA ABSTRACT A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG’s rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER’s annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG’s comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes. INTRODUCTION A rapidly increasing number of microbial genomes are sequenced by organizations worldwide, undergo similar annotation procedures, and are eventually included into public genome data resources. First, raw (“read”) sequences of microbial genomes are assembled into longer “contigs” (contiguous sequences) in order to produce “draft” genome sequences, with draft genomes sometimes “finished” by closing gaps between contigs. Next, annotation pipelines are used for predicting genes and determining their functional roles in draft or finished genomes. Subsequently, annotated microbial genome sequences are submitted to/collected by primary archival public sequence data repositories, such as Genbank (Benson et al. 2009), which perform data validation on genome datasets in order to ensure consistency of their format and, to a certain degree, their content. Datasets in these resources have different degrees of precision and resolution due to diverse annotation methods employed by individual data providers. Secondary public resources, such as NCBI’s RefSeq (Pruitt et al. 2007), further process microbial genome data from primary resources with the dual goals of providing the most current view on microbial genome sequences and of gradually increasing the quality and completeness of their associated functional annotations via manual curation and computation. In addition to public primary and secondary resources, microbial genome datasets are incorporated into a variety of tertiary resources, such as SEED (Overbeek et al. 2005) and IMG To whom correspondence should be addressed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call