Abstract

MotivationPreviously we presented swarm, an open-source amplicon clustering programme that produces fine-scale molecular operational taxonomic units (OTUs) that are free of arbitrary global clustering thresholds. Here, we present swarm v3 to address issues of contemporary datasets that are growing towards tera-byte sizes.ResultsWhen compared with previous swarm versions, swarm v3 has modernized C++ source code, reduced memory footprint by up to 50%, optimized CPU-usage and multithreading (more than 7 times faster with default parameters), and it has been extensively tested for its robustness and logic.Availability and implementationSource code and binaries are available at https://github.com/torognes/swarm.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • In emerging planetary biology, large-scale amplicon sequencing datasets are used to unravel global ecological and evolutionary patterns within and across biomes and biota (de Vargas et al, 2015; Mahé et al, 2017; Giner et al, 2020)

  • Motivation: Previously we presented swarm, an open-source amplicon clustering program that produces fine-scale molecular operational taxonomic units (OTUs) that are free of arbitrary global clustering thresholds

  • Availability: Source code and binaries are available at https://github.com/torognes/swarm Contact: frederic.mahe@cirad.fr Supplementary information: Supplementary data are available at Bioinformatics online

Read more

Summary

Introduction

Large-scale amplicon sequencing datasets are used to unravel global ecological and evolutionary patterns within and across biomes and biota (de Vargas et al, 2015; Mahé et al, 2017; Giner et al, 2020). A critical bioinformatics step in the handling of these massive metabarcoding datasets is to cluster the sequencing reads into operational taxonomic units (OTUs). Swarm v1 (Mahé et al, 2014) was introduced as a novel approach to cluster amplicons into OTUs, inspired by previous single-linkage methods such as DOTUR (Schloss & Handelsman, 2005). The key underlying idea of swarm was to use a local, iterative, single-linkage clustering process to group closely related sequences (by default with one difference in their nucleotide sequences, i.e. d = 1). The code could only be executed on GNU/Linux and macOS on x86-64 CPUs. And swarm v2 was multithreaded and fast, its time and memory requirements could become a limiting factor on very large current and future datasets, especially as amplicon sequences become longer.

Code quality and portability
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.