Abstract

Motivation: DNA metabarcoding is commonly used to infer the species composition of environmental samples, whereby a short, homologous DNA sequence is amplified and sequenced from all members of the community. Samples can comprise hundreds of organisms that can be closely or very distantly related. DNA metabarcoding combines polymerase chain reaction (PCR) and next-generation sequencing (NGS), and sequences are taxonomically identified based on their match to a reference database. Ideally, each species of interest would have a unique DNA barcode. This short, variable sequence needs to be flanked by conserved regions that can be used as primer-binding sites. PCR primer pairs would amplify a variable barcode in a broad evolutionary range of taxa. To date, no tools exist that computationally search and analyze the effectiveness of new primer pairs for large unaligned sequence data sets. More specifically we solve the following problem: Given a set of reference sequences R = {R1, R2, ..., Rm}, find a primer set P that allows for a high taxonomic coverage. This goal can be achieved by filtering for frequent primers and ranking by coverage or variation, i.e. the number of unique barcodes for further analysis. Here we present the software PriSeT, an offline primer-discovery tool that is capable of processing large libraries and is robust against mislabeled or low-quality references. It avoids the construction of a multisequence alignment of R. Instead, PriSeT uses encodings of frequent k-mers that allow bit-parallel processing and other optimizations. Results: We first evaluated PriSeT on references (mostly 18S rRNA genes) from 19 clades covering eukaryotic organisms that are typical for freshwater plankton samples. PriSeT recovered several published primer sets as well as additional, more chemically suitable primer sets. For these new sets, we compared frequency, taxonomic coverage, and amplicon variation with published primer sets. For 11 clades we found de novo primer pairs that cover more taxa than the published ones, and for six clades de novo primers resulted in greater sequence (i.e., DNA barcode) variation. We also applied PriSeT to SARS-CoV-2 genomes and computed 114 new primer pairs with the additional constraint that the sequences have no co-occurrences in closely related taxa. These primer sets would be suitable for empirical testing. Availability: https://github.com/mariehoffmann/PriSeT Contact: marie.hoffmann@fu-berlin.de

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.