Abstract

Cyanorak v2.1 (http://www.sb-roscoff.fr/cyanorak) is an information system dedicated to visualizing, comparing and curating the genomes of Prochlorococcus, Synechococcus and Cyanobium, the most abundant photosynthetic microorganisms on Earth. The database encompasses sequences from 97 genomes, covering most of the wide genetic diversity known so far within these groups, and which were split into 25,834 clusters of likely orthologous groups (CLOGs). The user interface gives access to genomic characteristics, accession numbers as well as an interactive map showing strain isolation sites. The main entry to the database is through search for a term (gene name, product, etc.), resulting in a list of CLOGs and individual genes. Each CLOG benefits from a rich functional annotation including EggNOG, EC/K numbers, GO terms, TIGR Roles, custom-designed Cyanorak Roles as well as several protein motif predictions. Cyanorak also displays a phyletic profile, indicating the genotype and pigment type for each CLOG, and a genome viewer (Jbrowse) to visualize additional data on each genome such as predicted operons, genomic islands or transcriptomic data, when available. This information system also includes a BLAST search tool, comparative genomic context as well as various data export options. Altogether, Cyanorak v2.1 constitutes an invaluable, scalable tool for comparative genomics of ecologically relevant marine microorganisms.

Highlights

  • The regular decrease in sequencing costs associated with the rapid development of Genome Sequencing (NGS) technologies has led to the multiplication of microbial genomes [1,2], making possible extensive comparative genomics studies

  • Built from 97 picocyanobacterial genomes, including 43 Prochlorococcus and 54 Synechococcus/Cyanobium, which are representative of the wide genetic and pigment diversity existing within these genera (Figure 1), Cyanorak v2.1 encompasses 252,176 genes that were split into 25,834 clusters of likely orthologous groups (CLOGs)

  • A plot of the distribution of the number of sequences per CLOG expectedly shows that the most frequent categories are CLOGs with one sequence, i.e. unique genes (15,283 CLOGs), and CLOGs with few [2,3,4,5] members (Supplementary Figure S1). Most of these CLOGs (e.g. 91% of unique genes) are annotated as ‘hypothetical’ or ‘conserved hypothetical’ proteins, a number of them display a more precise functional annotation, since they share some similarities to genes or domains of known function, with among the most abundant: glycosyltransferases, restriction-modification system proteins, integrases, transposases, methyltransferases, NADdependent epimerases/dehydratases and tetratricopeptide repeat (TPR) family proteins

Read more

Summary

Introduction

The regular decrease in sequencing costs associated with the rapid development of Genome Sequencing (NGS) technologies has led to the multiplication of microbial genomes [1,2], making possible extensive comparative genomics studies. A smart alternative is to curate several phylogenetically related genomes at a time, after gathering sequences into Clusters of Likely Orthologous Genes (CLOGs), i.e. genes that exhibit reciprocal best hits to one another and are hypothesized to have the same function in the different members of the dataset [5,6] This strategy, notably used in the NCBI’s ‘prokaryotic genome annotation pipeline’ [7] for annotating new genomes or re-annotating older genomes before inclusion in the RefSeq database, allows propagating rich, functional annotations made at CLOG level to all proteins composing the CLOG and makes it possible to unify and standardize these annotations across all sequenced strains

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call