Abstract

Determining the taxonomic affiliation of sequences assembled from metagenomes remains a major bottleneck that affects research across the fields of environmental, clinical and evolutionary microbiology. Here, we introduce MyTaxa, a homology-based bioinformatics framework to classify metagenomic and genomic sequences with unprecedented accuracy. The distinguishing aspect of MyTaxa is that it employs all genes present in an unknown sequence as classifiers, weighting each gene based on its (predetermined) classifying power at a given taxonomic level and frequency of horizontal gene transfer. MyTaxa also implements a novel classification scheme based on the genome-aggregate average amino acid identity concept to determine the degree of novelty of sequences representing uncharacterized taxa, i.e. whether they represent novel species, genera or phyla. Application of MyTaxa on in silico generated (mock) and real metagenomes of varied read length (100–2000 bp) revealed that it correctly classified at least 5% more sequences than any other tool. The analysis also showed that ∼10% of the assembled sequences from human gut metagenomes represent novel species with no sequenced representatives, several of which were highly abundant in situ such as members of the Prevotella genus. Thus, MyTaxa can find several important applications in microbial identification and diversity studies.

Highlights

  • Culture-independent whole-genome shotgun (WGS) DNA sequencing has revolutionized the study of the diversity and ecology of microbial communities during the last decade [1,2]

  • Users can start MyTaxa analysis by supplying two files: (i) a standard GFF file containing the genes predicted on the query sequences by gene prediction tools such as metaGeneMark, Prodigal or FragGeneScan [20,21,22]; and (ii) a tabular output file from the similarity search of the predicted gene sequences against the sequences used to construct the database of gene weights or another database that includes the GI accession number of the matching gene

  • The current taxonomic system, especially the ranks higher than the species rank, is primarily based on the grouping patterns of the 16S rRNA gene phylogeny but no standards exist on the degree of genetic relatedness of the organisms grouped at different ranks

Read more

Summary

Introduction

Culture-independent whole-genome shotgun (WGS) DNA sequencing has revolutionized the study of the diversity and ecology of microbial communities during the last decade [1,2]. The taxonomic identity of most sequences assembled from a metagenomic dataset frequently remains elusive, making the exchange of information about an organism or a DNA sequence challenging when a name for it is not available. This limitation severely impedes communication among scientists and scientific discovery across the fields of ecology, systematics, evolution, engineering and medicine. The 16S rRNA gene provides limited resolution at the species level, which represents a major limitation for epidemiological and micro-diversity studies [11] To overcome these limitations, whole-genome-based approaches and tools, comparable to those already available for the 16S rRNA gene, are highly needed. It is important for these tools to scale with the increasingly large volume of sequence data produced by the new sequencers and to be able to detect and categorize novel taxa, e.g. determine if the taxa represent novel species or genera

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call