Abstract

Motivation: Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets.Results: We introduce the Poisson tree processes (PTP) model to infer putative species boundaries on a given phylogenetic input tree. We also integrate PTP with our evolutionary placement algorithm (EPA-PTP) to count the number of species in phylogenetic placements. We compare our approaches with popular OTU-picking methods and the General Mixed Yule Coalescent (GMYC) model. For de novo species delimitation, the stand-alone PTP model generally outperforms GYMC as well as OTU-picking methods when evolutionary distances between species are small. PTP neither requires an ultrametric input tree nor a sequence similarity threshold as input. In the open reference species delimitation approach, EPA-PTP yields more accurate results than de novo species delimitation methods. Finally, EPA-PTP scales on large datasets because it relies on the parallel implementations of the EPA and RAxML, thereby allowing to delimit species in high-throughput sequencing data.Availability and implementation: The code is freely available at www.exelixis-lab.org/software.html.Contact: Alexandros.Stamatakis@h-its.orgSupplementary information: Supplementary data are available at Bioinformatics online.

Highlights

  • DNA barcoding studies mostly rely on a single marker gene and are widely used for DNA taxonomy (Goldstein and DeSalle, 2011; Vogler and Monaghan, 2007)

  • We describe the open reference species delimitation pipeline that combines the evolutionary placement algorithm (EPA) with the Poisson tree processes (PTP) (EPA-PTP)

  • Because the PTP method requires a correctly rooted tree, we use the following two rooting strategies: if the branch leads to a tip, apart from the query sequences, we extend the alignment by including the reference tree tip sequence and that reference sequence that is furthest away from the current tip

Read more

Summary

Introduction

DNA barcoding studies mostly rely on a single marker gene and are widely used for DNA taxonomy (Goldstein and DeSalle, 2011; Vogler and Monaghan, 2007). Numerous approaches exist for associating anonymous reads/ query sequences with known species, for instance, nearest-neighbor BLAST (Liu et al, 2008) or the naıve Bayesian classifier (Wang et al, 2007). These methods use sequence similarity to associate reads with taxonomic ranks. Placement methods are similar to closed-reference OTUpicking (Bik et al, 2012) or taxonomy-dependent methods (Schloss and Westcott, 2011). Their ability to associate query sequences with species depends on the completeness of the taxon sampling in the reference data (Meyer and Paulay, 2005). Closed-reference or taxonomy-dependent methods generally lack the ability to delimit new species; they may underestimate the number of species and the diversity in the query sequences (see example in Supplementary Fig. S1)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call