Abstract

SummaryMMseqs2 taxonomy is a new tool to assign taxonomic labels to metagenomic contigs. It extracts all possible protein fragments from each contig, quickly retains those that can contribute to taxonomic annotation, assigns them with robust labels and determines the contig’s taxonomic identity by weighted voting. Its fragment extraction step is suitable for the analysis of all domains of life. MMseqs2 taxonomy is 2–18× faster than state-of-the-art tools and also contains new modules for creating and manipulating taxonomic reference databases as well as reporting and visualizing taxonomic assignments.Availability and implementationMMseqs2 taxonomy is part of the MMseqs2 free open-source software package available for Linux, macOS and Windows at https://mmseqs.com.Supplementary informationSupplementary data are available at Bioinformatics online.

Highlights

  • Metagenomic studies shine a light on previously unstudied parts of the tree of life

  • Despite its advantage over existing methods, CAT has limitations: (1) Prodigal was designed for prokaryotes and not eukaryotes [13]; (2) Prodigal runs single-threaded, limiting applicability to metagenomics; (3) CAT’s r parameter determines the cut-off score below each open reading frames (ORFs)’s top-hit above which hits are included in the ORF’s lowest common ancestor (LCA) computation

  • All 57 SAR RefSeq assemblies and their taxonomic labels were downloaded from NCBI in 08/2020

Read more

Summary

INTRODUCTION

Metagenomic studies shine a light on previously unstudied parts of the tree of life. unraveling taxonomic composition accurately and quickly remains a challenge. [12] developed CAT, a tool for taxonomic annotation of contigs based on protein homologies to a reference database. It combines Prodigal [7] for predicting open reading frames (ORFs), DIAMOND [3] to search with the translated ORFs, and logic to aggregate individual ORF annotations. We present MMseqs taxonomy, a novel proteinsearch-based tool for taxonomy assignment to contigs It overcomes the aforementioned limitations by extracting all possible protein fragments, covering the coding repertoire of all domains of life. The hits for the a2bLCA computation are determined automatically, saving the need to tune an equivalent of CAT’s r parameter It outperforms CAT on bacterial and eukaryotic data sets

METHODS
RESULTS
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.