Abstract
SummaryMMseqs2 taxonomy is a new tool to assign taxonomic labels to metagenomic contigs. It extracts all possible protein fragments from each contig, quickly retains those that can contribute to taxonomic annotation, assigns them with robust labels and determines the contig’s taxonomic identity by weighted voting. Its fragment extraction step is suitable for the analysis of all domains of life. MMseqs2 taxonomy is 2–18× faster than state-of-the-art tools and also contains new modules for creating and manipulating taxonomic reference databases as well as reporting and visualizing taxonomic assignments.Availability and implementationMMseqs2 taxonomy is part of the MMseqs2 free open-source software package available for Linux, macOS and Windows at https://mmseqs.com.Supplementary informationSupplementary data are available at Bioinformatics online.
Highlights
Metagenomic studies shine a light on previously unstudied parts of the tree of life
Despite its advantage over existing methods, CAT has limitations: (1) Prodigal was designed for prokaryotes and not eukaryotes [13]; (2) Prodigal runs single-threaded, limiting applicability to metagenomics; (3) CAT’s r parameter determines the cut-off score below each open reading frames (ORFs)’s top-hit above which hits are included in the ORF’s lowest common ancestor (LCA) computation
All 57 SAR RefSeq assemblies and their taxonomic labels were downloaded from NCBI in 08/2020
Summary
Metagenomic studies shine a light on previously unstudied parts of the tree of life. unraveling taxonomic composition accurately and quickly remains a challenge. [12] developed CAT, a tool for taxonomic annotation of contigs based on protein homologies to a reference database. It combines Prodigal [7] for predicting open reading frames (ORFs), DIAMOND [3] to search with the translated ORFs, and logic to aggregate individual ORF annotations. We present MMseqs taxonomy, a novel proteinsearch-based tool for taxonomy assignment to contigs It overcomes the aforementioned limitations by extracting all possible protein fragments, covering the coding repertoire of all domains of life. The hits for the a2bLCA computation are determined automatically, saving the need to tune an equivalent of CAT’s r parameter It outperforms CAT on bacterial and eukaryotic data sets
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.