Biodiversity Soup II: A bulk‐sample metabarcoding pipeline emphasizing error reduction

Chunyan Yang,Kristine Bohmann,Nathan Wales,Wang Cai,Xiaoyang Wang,Shyam Gopalakrishnan,Douglas W Yu,Zhaoli Ding

doi:10.1111/2041-210x.13602

Abstract

Abstract Despite widespread recognition of its great promise to aid decision‐making in environmental management, the applied use of metabarcoding requires improvements to reduce the multiple errors that arise during PCR amplification, sequencing and library generation. We present a co‐designed wet‐lab and bioinformatic workflow for metabarcoding bulk samples that removes both false‐positive (tag jumps, chimeras, erroneous sequences) and false‐negative (‘dropout’) errors. However, we find that it is not possible to recover relative‐abundance information from amplicon data, due to persistent species‐specific biases. To present and validate our workflow, we created eight mock arthropod soups, all containing the same 248 arthropod morphospecies but differing in absolute and relative DNA concentrations, and we ran them under five different PCR conditions. Our pipeline includes qPCR‐optimized PCR annealing temperature and cycle number, twin‐tagging, multiple independent PCR replicates per sample, and negative and positive controls. In the bioinformatic portion, we introduce Begum, which is a new version of DAMe (Zepeda‐Mendoza et al., 2016. BMC Res. Notes 9:255) that ignores heterogeneity spacers, allows primer mismatches when demultiplexing samples and is more efficient. Like DAMe, Begum removes tag‐jumped reads and removes sequence errors by keeping only sequences that appear in more than one PCR above a minimum copy number per PCR. The filtering thresholds are user‐configurable. We report that OTU dropout frequency and taxonomic amplification bias are both reduced by using a PCR annealing temperature and cycle number on the low ends of the ranges currently used for the Leray‐FolDegenRev primers. We also report that tag jumps and erroneous sequences can be nearly eliminated with Begum filtering, at the cost of only a small rise in dropouts. We replicate published findings that uneven size distribution of input biomasses leads to greater dropout frequency and that OTU size is a poor predictor of species input biomass. Finally, we find no evidence for ‘tag‐biased’ PCR amplification. To aid learning, reproducibility, and the design and testing of alternative metabarcoding pipelines, we provide our Illumina and input‐species sequence datasets, scripts, a spreadsheet for designing primer tags and a tutorial.

Highlights

We report that Operational Taxonomic Units (OTUs) dropout frequency and taxonomic amplification bias are both reduced by using a PCR annealing temperature and cycle number on the low ends of the ranges currently used for the Leray-FolDegenRev primers
We show that with Begum filtering, metabarcoding efficiency is highest with a PCR cycle number and annealing temperature at the low ends of the ranges currently used in metabarcoding studies, that Begum filtering nearly eliminates false-positive OTUs, at the cost of only a small absolute rise in falsenegative frequency, that greater species evenness and higher concentrations reduce dropouts and that OTU sizes are not reliable estimators of species relative abundances
We tested our pipeline with eight mock soups that differed in their absolute and relative DNA concentrations of 248 arthropod taxa (Table 2, Figure 2)

Summary

Methods

As part of the pipeline, we introduce a new version of the DAMe software package (Zepeda-Mendoza et al, 2016), renamed Begum (Hindi for ‘lady’), to demutiplex samples, remove tag-jumped sequences and filter out erroneous sequences (Alberdi et al, 2018) Regarding the latter, the DAMe/Begum logic is that true sequences are more likely to appear in multiple, independent PCR replicates and in multiple copies than are erroneous sequences (indels, substitutions, chimeras). Begum improves on DAMe by ignoring heterogeneity spacers in the amplicon, allowing primer mismatches during demultiplexing, and by being more efficient We note that this logic is less applicable to species represented by trace DNA, such as in water samples, where low concentrations of DNA template are more likely to result in a species truly appearing in only one PCR (Harper et al, 2018; Piaggio et al, 2014). Not included are software bugs, general laboratory and field errors like mislabelling, sampling biases or inadequate sequencing depth

F Primer

| DISCUSSION

| Future work