Abstract

SummaryGenome-level evolutionary inference (i.e. phylogenomics) is becoming an increasingly essential step in many biologists’ work. Accordingly, there are several tools available for the major steps in a phylogenomics workflow. But for the biologist whose main focus is not bioinformatics, much of the computational work required—such as accessing genomic data on large scales, integrating genomes from different file formats, performing required filtering, stitching different tools together etc.—can be prohibitive. Here I introduce GToTree, a command-line tool that can take any combination of fasta files, GenBank files and/or NCBI assembly accessions as input and outputs an alignment file, estimates of genome completeness and redundancy, and a phylogenomic tree based on a specified single-copy gene (SCG) set. Although GToTree can work with any custom hidden Markov Models (HMMs), also included are 13 newly generated SCG-set HMMs for different lineages and levels of resolution, built based on searches of ∼12 000 bacterial and archaeal high-quality genomes. GToTree aims to give more researchers the capability to make phylogenomic trees.Availability and implementationGToTree is open-source and freely available for download from: github.com/AstrobioMike/GToTree. It is implemented primarily in bash with helper scripts written in python.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • The number of sequenced genomes is increasing rapidly, largely through the recovery of metagenome-assembled genomes (MAGs) (e.g. Hug et al 2016; Parks et al 2017) and through the generation of single-cell amplified genomes (SAGs) (e.g. Kashtan et al 2014; Berube et al 2018)

  • Large-scale comparative genomics efforts leveraging growing public databases can be employed to investigate evolutionary avenues such as ancestral reconstruction (Braakman, Follows, and Chisholm 2017), which are guided by phylogenomics

  • There are several tools available for the major steps in a phylogenomics workflow, and at least one analysis platform that incorporates a phylogenomics workflow amid a larger infrastructure

Read more

Summary

Introduction

The number of sequenced genomes is increasing rapidly, largely through the recovery of metagenome-assembled genomes (MAGs) (e.g. Hug et al 2016; Parks et al 2017) and through the generation of single-cell amplified genomes (SAGs) (e.g. Kashtan et al 2014; Berube et al 2018). GToTree fills a void on three primary fronts: 1) it accepts as input any combination of fasta files, GenBank files, and/or NCBI accessions – allowing integration of genomes from various sources and stages of analysis without any computational burden to the user; 2) it enables the automation of required between-tool tasks such as filtering out hits by gene-length, filtering out genomes with too few hits to the specified target genes, and swapping genome identifiers so resulting trees and alignments can be explored more ; and 3) its scalability – GToTree can turn ~1,700 input genomes into a tree in one hour on a standard laptop, and can optionally run many steps in parallel. The required inputs to GToTree are 1) any combination of fasta files, GenBank files, and/or NCBI assembly accessions, and 2) an HMM file with the target genes. The user can provide a mapping file of specific input genome IDs with the labels they would like to have displayed in the final alignment and tree.

Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call