Abstract

BackgroundTaxonomic classification of genetic markers for microbiome analysis is affected by the numerous choices made from sample preparation to bioinformatics analysis. Paired-end read merging is routinely used to capture the entire amplicon sequence when the read ends overlap. However, the exclusion of unmerged reads from further analysis can result in underestimating the diversity in the sequenced microbial community and is influenced by bioinformatic processes such as read trimming and the choice of reference database. A potential solution to overcome this is to concatenate (join) reads that do not overlap and keep them for taxonomic classification. The use of concatenated reads can outperform taxonomic recovery from single-end reads, but it remains unclear how their performance compares to merged reads. Using various sequenced mock communities with different amplicons, read length, read depth, taxonomic composition, and sequence quality, we tested how merging and concatenating reads performed for genus recall and precision in bioinformatic pipelines combining different parameters for read trimming and taxonomic classification using different reference databases.ResultsThe addition of concatenated reads to merged reads always increased pipeline performance. The top two performing pipelines both included read concatenation, with variable strengths depending on the mock community. The pipeline that combined merged and concatenated reads that were quality-trimmed performed best for mock communities with larger amplicons and higher average quality sequences. The pipeline that used length-trimmed concatenated reads outperformed quality trimming in mock communities with lower quality sequences but lost a significant amount of input sequences for taxonomic classification during processing. Genus level classification was more accurate using the SILVA reference database compared to Greengenes.ConclusionsMerged sequences with the addition of concatenated sequences that were unable to be merged increased performance of taxonomic classifications. This was especially beneficial in mock communities with larger amplicons. We have shown for the first time, using an in-depth comparison of pipelines containing merged vs concatenated reads combined with different trimming parameters and reference databases, the potential advantages of concatenating sequences in improving resolution in microbiome investigations.

Highlights

  • Taxonomic classification of genetic markers for microbiome analysis is affected by the numerous choices made from sample preparation to bioinformatics analysis

  • Each pipeline generated a list of amplicon sequence variants (ASVs) for each mock, which were taxonomically identified to the bacterial genus level

  • Using the list of genera detected from each pipeline, the presence and absence of each mock community bacterial member were used to determine the number of genera that were true positives (TPs), false positives (FPs), and false negatives (FNs)

Read more

Summary

Introduction

Taxonomic classification of genetic markers for microbiome analysis is affected by the numerous choices made from sample preparation to bioinformatics analysis. Trimming 16S rRNA gene sequences by length might be preferable over quality trimming for unbiased clustering operational taxonomic units (OTUs) [17], and it reduces the number of amplicon sequence variants (ASVs) by eliminating length variation. In both cases though, trimming might cause a loss of informative base pairs necessary for merging paired reads or important for distinguishing between closely related taxa

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call