CDSnake: Snakemake pipeline for retrieval of annotated OTUs from paired-end reads using CD-HIT utilities

Yulia Kondratenko,Anton Korobeynikov,Alla Lapidus

doi:10.1186/s12859-020-03591-6

Yulia Kondratenko, Anton Korobeynikov + Show 1 more

Open Access

https://doi.org/10.1186/s12859-020-03591-6

Copy DOI

Journal: BMC bioinformatics	Publication Date: Jul 1, 2020
Citations: 2	License type: open-access

Affiliation: St Petersburg University

Abstract

BackgroundIllumina paired-end reads are often used for 16S analysis in metagenomic studies. Since DNA fragment size is usually smaller than the sum of lengths of paired reads, reads can be merged for downstream analysis. In spite of development of several tools for merging of paired-end reads, poor quality at the 3′ ends within the overlapping region prevents the accurate combining of significant portion of read pairs. Recently CD-HIT-OTU-Miseq was presented as a new approach for 16S analysis using the paired-end reads, it completely avoids the reads merging process due to separate clustering of paired reads. CD-HIT-OTU-Miseq is a set of tools which are supposed to be successively launched by auxiliary shell scripts. This launch mode is not suitable for processing of big amounts of data generated in modern omics experiments. To solve this issue we created CDSnake – Snakemake pipeline utilizing CD-HIT tools for easier consecutive launch of CD-HIT-OTU-Miseq tools for complete processing of paired end reads in metagenomic studies. Usage of pipeline make 16S analysis easier due to one-command launch and helps to yield reproducible results.ResultsWe benchmarked our pipeline against two commonly used pipelines for OTU retrieval, incorporated into popular workflow for microbiome analysis, QIIME2 - DADA2 and deblur. Three mock datasets having highly overlapping paired-end 2 × 250 bp reads were used for benchmarking - Balanced, HMP, and Extreme. CDSnake outputted less OTUs than DADA2 and deblur. However, on Balanced and HMP datasets number of OTUs outputted by CDSnake was closer to real number of strains which were used for mock community generation, than those outputted by DADA2 and deblur. Though generally slower than other pipelines, CDSnake outputted higher total counts, preserving more information from raw data. Inheriting this properties from original CD-HIT-OTU-MiSeq utilities, CDSnake made their usage handier due to simple scalability, easier automated runs and other Snakemake benefits.ConclusionsWe developed Snakemake pipeline for OTU-MiSeq utilities, which simplified and automated data analysis. Benchmarking showed that this approach is capable to outperform popular tools in certain conditions.

Highlights

Illumina paired-end reads are often used for 16S analysis in metagenomic studies
We developed Snakemake pipeline for OTU-MiSeq utilities, which simplified and automated data analysis
In spite of the development of several tools for merging of paired-end reads [1, 2], poor quality sequences at the 3′ ends of both paired-end reads in the overlapping region prevent the correct assembly of significant portion of read pairs

Summary

Results

We benchmarked our pipeline against two commonly used pipelines for OTU retrieval, incorporated into popular workflow for microbiome analysis, QIIME2 [8]. The exception was HMP dataset of lowest quality, where deblur outputted less OTUs than CDSnake (53 vs 59) In this level of errors in input data clustering of reads by CD-HIT utilities outputted more OTUs than deblur after his dropping of erroneous sequences. On Extreme dataset CDSnake, as expected, performed worse than DADA2 and deblur, since clustering algorithm cannot separate sequencing errors from actual 1-nt differences, present between strains in this community. Especially in cases when too many features were outputted for one source strain, this heterogeneity can be artefact of sequencing errors or incorrect work of error correction algorithms, if they were applied Considering such complex mapping of annotated features to known annotations, we provide second measure of correctness of annotation – number of microorganisms. This can be explained by usage of additional python components which are necessary to run Snakemake pipelines

Background

Conclusion