Abstract

Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75–100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62–98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60–99.37%) for 1094 brown algae queries, both using ITS barcodes.

Highlights

  • Neotropical bat and Marine fish c oxidase subunit I (COI) datasets In total, 8,120 random queries from 766 bat COI sequences were examined with two traditional methods (NJ and Maximum likelihood (ML)) and the two newly proposed methods (DV-Radial Basis Function (RBF) and FJ-RBF) against corresponding reference libraries. 5,180 of these queries were carried out using 5 repeated random splits, representing complete/balanced species coverage in the reference library

  • In the case of balanced species coverage, both DV-Curve based RBF (DV-RBF) and FJ-RBF methods achieved 100% success rates with 1,295 random queries each (Figure 1a), while the NJ and ML methods obtained success rates of 95.75%, and 87.25% respectively

  • With 766 random queries for each of DV-RBF and FJ-RBF, the NJ method outperformed all other methods (94.86% with a 95% confidence intervals (CI): 92.93–96.28%) compared with ML 88.97%; DV-RBF 86.18% and FJ-RBF 81.54% (Figure 1a)

Read more

Summary

Introduction

DNA barcoding has become increasingly popular as a tool for species discrimination and identification [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19], some aspects remain controversial [20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]. A fundamental issue in DNA barcoding is how best to assign a query sequence from an unknown specimen to the correct species in the reference sequence database [15,19,24,25,35,36,37,38,39,40,41,42,43]. If the query falls into a polyphyletic or paraphyletic clade, assignation to correct species becomes ambiguous

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call