Abstract
Biological conclusions based on DNA barcoding and metabarcoding analyses can be strongly influenced by the methods utilized for data generation and curation, leading to varying levels of success in the separation of biological variation from experimental error. The 5' region of cytochrome c oxidase subunit I (COI-5P) is the most common barcode gene for animals, with conserved structure and function that allows for biologically informed error identification. Here, we present coil ( https://CRAN.R-project.org/package=coil ), an R package for the pre-processing and frameshift error assessment of COI-5P animal barcode and metabarcode sequence data. The package contains functions for placement of barcodes into a common reading frame, accurate translation of sequences to amino acids, and highlighting insertion and deletion errors. The analysis of 10 000 barcode sequences of varying quality demonstrated how coil can place barcode sequences in reading frame and distinguish sequences containing indel errors from error-free sequences with greater than 97.5% accuracy. Package limitations were tested through the analysis of COI-5P sequences from the plant and fungal kingdoms as well as the analysis of potential contaminants: nuclear mitochondrial pseudogenes and Wolbachia COI-5P sequences. Results demonstrated that coil is a strong technical error identification method but is not reliable for detecting all biological contaminants.
Highlights
Answering questions about biodiversity through DNA barcode analyses depends on the comparison of novel barcode sequences to reference libraries or the de novo comparison of sequences to one another (Hebert et al 2004; Ratnasingham and Hebert 2007; Hubert and Hanner 2015; Elbrecht et al 2018)
An approximately 657bp fragment of the 5’ region of the cytochrome C oxidase subunit I gene (COI-5P) is the main marker utilized in DNA barcoding of the animal kingdom (Hebert et al 2003)
We demonstrate the effectiveness of coil by showing how it can be used to align novel barcode sequences to the COI-5P profile and to identify sequences with insertion or deletion errors, in most cases with greater than 97.5% accuracy
Summary
DNA barcoding leverages sequence diversity within standardized gene regions for the identification and classification of organisms (Hebert et al 2003; Ratnasingham and Hebert 2007).Answering questions about biodiversity through DNA barcode analyses depends on the comparison of novel barcode sequences to reference libraries or the de novo comparison of sequences to one another (Hebert et al 2004; Ratnasingham and Hebert 2007; Hubert and Hanner 2015; Elbrecht et al 2018).Techniques such as DNA metabarcoding greatly expand the complexity of comparative analyses due to the increased scale and associated challenges such as additional noise in datasets (Cristescu 2014).Barcode and metabarcode output sequences can vary in terms of both length and accuracy due to a mixture of true biological variation (Pentinsaari et al 2016), the primers or sequencing platforms utilized (Folmer et al 1994; Hebert et al 2018), and the data cleaning steps employed (Elbrecht et al.2018). Answering questions about biodiversity through DNA barcode analyses depends on the comparison of novel barcode sequences to reference libraries or the de novo comparison of sequences to one another (Hebert et al 2004; Ratnasingham and Hebert 2007; Hubert and Hanner 2015; Elbrecht et al 2018). Techniques such as DNA metabarcoding greatly expand the complexity of comparative analyses due to the increased scale and associated challenges such as additional noise in datasets (Cristescu 2014). An approximately 657bp fragment of the 5’ region of the cytochrome C oxidase subunit I gene (COI-5P) is the main marker utilized in DNA barcoding of the animal kingdom (Hebert et al 2003)
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have