Abstract

The recent advent of DNA sequencing technologies facilitates the use of genome sequencing data that provide means for more informative and precise classification and identification of members of the Bacteria and Archaea. Because the current species definition is based on the comparison of genome sequences between type and other strains in a given species, building a genome database with correct taxonomic information is of paramount need to enhance our efforts in exploring prokaryotic diversity and discovering novel species as well as for routine identifications. Here we introduce an integrated database, called EzBioCloud, that holds the taxonomic hierarchy of the Bacteria and Archaea, which is represented by quality-controlled 16S rRNA gene and genome sequences. Whole-genome assemblies in the NCBI Assembly Database were screened for low quality and subjected to a composite identification bioinformatics pipeline that employs gene-based searches followed by the calculation of average nucleotide identity. As a result, the database is made of 61 700 species/phylotypes, including 13 132 with validly published names, and 62 362 whole-genome assemblies that were identified taxonomically at the genus, species and subspecies levels. Genomic properties, such as genome size and DNA G+C content, and the occurrence in human microbiome data were calculated for each genus or higher taxa. This united database of taxonomy, 16S rRNA gene and genome sequences, with accompanying bioinformatics tools, should accelerate genome-based classification and identification of members of the Bacteria and Archaea. The database and related search tools are available at www.ezbiocloud.net/.

Highlights

  • One of the goals of the modern taxonomy of the Bacteria and Archaea is the objective definition of species, insofar as it applies to classification and identification

  • For cases in which multiple sequences were available for a type strain, the sequence extracted from its whole-genome assembly (WGA) was selected

  • Taxa without their type or representative 16S rRNA gene sequences were not included in the database

Read more

Summary

Introduction

One of the goals of the modern taxonomy of the Bacteria and Archaea is the objective definition of species, insofar as it applies to classification and identification. Two types of databases were used, namely (i) the 16S rRNA gene sequence database that is used in the ‘Identify’ engine described above, and (ii) the Reference Genome Database (RefGD). The latter was compiled to hold tetra-nucleotide compositions [17], and gyrB and recA sequences from all available genome sequences of type or representative strains.

Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.