Abstract

Bacillus pumilus group strains have been studied due their agronomic, biotechnological or pharmaceutical potential. Classifying strains of this taxonomic group at species level is a challenging procedure since it is composed of seven species that share among them over 99.5% of 16S rRNA gene identity. In this study, first, a whole-genome in silico approach was used to accurately demarcate B. pumilus group strains, as a case of highly phylogenetically related taxa, at the species level. In order to achieve that and consequently to validate or correct taxonomic identities of genomes in public databases, an average nucleotide identity correlation, a core-based phylogenomic and a gene function repertory analyses were performed. Eventually, more than 50% such genomes were found to be misclassified. Hierarchical clustering of gene functional repertoires was also used to infer ecotypes among B. pumilus group species. Furthermore, for the first time the machine-learning algorithm Random Forest was used to rank genes in order of their importance for species classification. We found that ybbP, a gene involved in the synthesis of cyclic di-AMP, was the most important gene for accurately predicting species identity among B. pumilus group strains. Finally, principal component analysis was used to classify strains based on the distances between their ybbP genes. The methodologies described could be utilized more broadly to identify other highly phylogenetically related species in metagenomic or epidemiological assessments.

Highlights

  • The highly phylogenetically related B. pumilus group is composed by B. pumilus, B. safensis, B. altitudinis, B. stratosphericus, B. aerophilus, B. xiamenensis, and B. invictae species that share more than 99% of its 16S rRNA gene sequence similarity

  • Circumscription of Bacillus pumilus group strains in species using whole-genome data

  • To resolve the taxonomic identity of strains of the Bacillus pumilus group, a pipeline to circumscribe them at species level was employed

Read more

Summary

Introduction

The highly phylogenetically related B. pumilus group is composed by B. pumilus, B. safensis, B. altitudinis, B. stratosphericus, B. aerophilus, B. xiamenensis, and B. invictae species that share more than 99% of its 16S rRNA gene sequence similarity. An increasing number of genome sequences from B. pumilus group strains are becoming available, since these bacteria have wide range of agronomic, biotechnological, and pharmaceutical uses [1,2,3,4,5,6,7,8,9,10,11]. Bioinformatics tools were developed to use these data in an attempt to circumscribe bacterial species These include the in silico DNA-DNA hybridization H (is-DDH), average nucleotide identity (ANI) among shared genes, tetranucleotide frequency correlation coefficients, and multilocus sequence analysis (MLSA) using the core genome of a genus [14,15]. While there are well-curated genomic database [16], many genomes deposited in public databases are misnamed, mainly because of the common practice of identifying strains using 16S rRNA gene sequence data alone [17]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call