A protocol for species delineation of public DNA databases, applied to the Insecta.

Douglas Chesters,Chao-Dong Zhu

doi:10.1093/sysbio/syu038

Abstract

Public DNA databases are composed of data from many different taxa, although the taxonomic annotation on sequences is not always complete, which impedes the utilization of mined data for species-level applications. There is much ongoing work on species identification and delineation based on the molecular data itself, although applying species clustering to whole databases requires consolidation of results from numerous undefined gene regions, and introduces significant obstacles in data organization and computational load. In the current paper, we demonstrate an approach for species delineation of a sequence database. All DNA sequences for the insects were obtained and processed. After filtration of duplicated data, delineation of the database into species or molecular operational taxonomic units (MOTUs) followed a three-step process in which (i) the genetic loci L are partitioned, (ii) the species S are delineated within each locus, then (iii) species units are matched across loci to form the matrix L × S, a set of global (multilocus) species units. Partitioning the database into a set of homologous gene fragments was achieved by Markov clustering using edge weights calculated from the amount of overlap between pairs of sequences, then delineation of species units and assignment of species names were performed for the set of genes necessary to capture most of the species diversity. The complexity of computing pairwise similarities for species clustering was substantial at the cytochrome oxidase subunit I locus in particular, but made feasible through the development of software that performs pairwise alignments within the taxonomic framework, while accounting for the different ranks at which sequences are labeled with taxonomic information. Over 24 different homologs, the unidentified sequences numbered approximately 194,000, containing 41,525 species IDs (98.7% of all found in the insect database), and were grouped into 59,173 single-locus MOTUs by hierarchical clustering under parameters optimized independently for each locus. Species units from different loci were matched using a multipartite matching algorithm to form multilocus species units with minimal incongruence between loci. After matching, the insect database as represented by these 24 loci was found to be composed of 78,091 species units in total. 38,574 of these units contained only species labeled data, 34,891 contained only unlabeled data, leaving 4,626 units composed both of labeled and unlabeled sequences. In addition to giving estimates of species diversity of sequence repositories, the protocol developed here will facilitate species-level applications of modern-day sequence data sets. In particular, the L × S matrix represents a post-taxonomic framework that can be used for species-level organization of metagenomic data, and incorporation of these methods into phylogenetic pipelines will yield matrices more representative of species diversity.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A protocol for species delineation of public DNA databases, applied to the Insecta.

Abstract

Talk to us

Similar Papers

More From: Systematic biology

Lead the way for us

Journal: Systematic biology	Publication Date: Jun 14, 2014
Citations: 53

Similar Papers

Mitochondrial DNA (COI) analyses reveal that amphipod diversity is associated with environmental heterogeneity in deep‐sea habitats
Matthew A Knox ... Ian D Hogg
Molecular Ecology | VOL. 21
Matthew A Knox, et. al.Matthew A Knox ... Ian D Hogg
25 Aug 2012
Molecular Ecology | VOL. 21

MOTU analysis of ichthyoplankton biodiversity in the upper Yangtze River, China
F Cheng ... W Li
Journal of Applied Ichthyology | VOL. 29
F Cheng, et. al.F Cheng ... W Li
13 Apr 2013
Journal of Applied Ichthyology | VOL. 29

An integrative approach challenges species hypotheses and provides hints for evolutionary history of two Mediterranean freshwater palaemonid shrimps (Decapoda: Caridea)
A Jabłońska ... M Grabowski
The European Zoological Journal | VOL. 88
A Jabłońska, et. al.A Jabłońska ... M Grabowski
01 Jan 2020
The European Zoological Journal | VOL. 88

JMOTU and Taxonerator: Turning DNA Barcode Sequences into Annotated Operational Taxonomic Units
Martin Jones ... Mark Blaxter
PLoS ONE | VOL. 6
Martin Jones, et. al.Martin Jones ... Mark Blaxter
25 Apr 2011
PLoS ONE | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A protocol for species delineation of public DNA databases, applied to the Insecta.

Abstract

Talk to us

Similar Papers

More From: Systematic biology