SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets

Daniel M Portik,John J Wiens

doi:10.1111/2041-210x.13392

Abstract

Abstract Phylogenies with extensive taxon sampling have become indispensable for many types of ecological and evolutionary studies. Many large‐scale trees are based on a ‘supermatrix’ approach, which involves amalgamating thousands of published sequences for a group. Constructing up‐to‐date supermatrices can be challenging, especially as new sequences may become available almost constantly. Additionally, genomic datasets (composed of thousands of loci) are becoming common in phylogenetics and phylogeography, and present novel challenges for constructing such datasets. Here we present SuperCRUNCH, a Python toolkit for assembling large phylogenetic datasets. It can be applied to GenBank sequences, unpublished sequences or combinations of GenBank and unpublished data. SuperCRUNCH constructs local databases and uses them to conduct rapid searches for user‐specified sets of taxa and loci. Sequences are parsed into putative loci and passed through rigorous filtering steps. A post‐filtering step allows for selection of one sequence per taxon (i.e. species‐level supermatrix) or retention of all sequences per taxon (i.e. population‐level dataset). Importantly, SuperCRUNCH can generate ‘vouchered’ population‐level datasets, in which voucher information is used to generate multi‐locus phylogeographic datasets. SuperCRUNCH offers many options for taxonomy resolution, similarity filtering, sequence selection, alignment and file manipulation. We demonstrate the range of features available in SuperCRUNCH by generating a variety of phylogenetic datasets. Output datasets include traditional species‐level supermatrices, large‐scale phylogenomic matrices and phylogeographic datasets. Finally, we briefly compare the ability of SuperCRUNCH to construct species‐level supermatrices relative to alternative approaches. SuperCRUNCH generated a large‐scale supermatrix (1,400 taxa and 66 loci) from 16 GB of GenBank data in ~1.5 hr, and generated population‐level datasets (<350 samples, <10 loci) in <1 min. It outperformed alternative methods for supermatrix construction in terms of taxa, loci and sequences recovered. SuperCRUNCH is a modular bioinformatics toolkit that can be used to assemble datasets for any taxonomic group and scale (kingdoms to individuals). It allows rapid construction of supermatrices, greatly simplifying the process of updating large phylogenies with new data. It is also designed to produce population‐level datasets. SuperCRUNCH streamlines the major tasks required to process phylogenetic data, including filtering, alignment, trimming and formatting. SuperCRUNCH is open‐source, documented and available at https://github.com/dportik/SuperCRUNCH.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets

Abstract

Talk to us

Similar Papers

More From: Methods in Ecology and Evolution

Lead the way for us

Journal: Methods in Ecology and Evolution	Publication Date: Apr 11, 2020
Citations: 17

Similar Papers

TreeTuner: A pipeline for minimizing redundancy and complexity in large phylogenetic datasets
Xi Zhang ... John M Archibald
STAR protocols | VOL. 3
Xi Zhang, et. al.Xi Zhang ... John M Archibald
15 Feb 2022
STAR protocols | VOL. 3

Demographic model selection using random forests and the site frequency spectrum.
Megan L Smith ... Megan Ruffley
Molecular Ecology | VOL. 26
Megan L Smith, et. al.Megan L Smith ... Megan Ruffley
26 Jul 2017
Molecular Ecology | VOL. 26

Biogeography: Where do we go from here?
Jun Wen ... Richard H Ree
TAXON | VOL. 62
Jun Wen, et. al.Jun Wen ... Richard H Ree
01 Oct 2013
TAXON | VOL. 62

Implications from a 28S rRNA gene fragment for the phylogenetic relationships of halichondrid sponges (Porifera: Demospongiae)
D Erpenbeck ... J A J Breeuwer
Journal of Zoological Systematics and Evolutionary Research | VOL. 43
D Erpenbeck, et. al.D Erpenbeck ... J A J Breeuwer
01 May 2005
Journal of Zoological Systematics and Evolutionary Research | VOL. 43

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets

Abstract

Talk to us

Similar Papers

More From: Methods in Ecology and Evolution