The bulk of eukaryotic diversity is microbial, with macroscopic lineages such as plant, animals and fungi nesting among a plethora of diverse lineages that include amoebae, flagellates, ciliates, and many types of algae. Our understanding of the evolutionary relationships and genome properties of microbial eukaryotes is rapidly advancing through analyses of omics (transcriptomic, genomic) data. However, phylogenomic analyses are challenging for microeukaryotes, and particularly uncultivable lineages, as single-cell approaches generate a mixture of sequence data from hosts, associated microbiomes, and contaminants. Current practices include resampling of hand-curated gene sets that can be difficult for other researchers to replicate. To address these challenges, we present PhyloToL version 6.0, a modular, user-friendly pipeline that enables effective data curation that includes phylogeny-informed contamination removal, estimation of homologous gene families, and generation of both multisequence alignments and gene trees. We provide several databases that will be of use for those interested in eukaryotic evolution: a Hook Database of curated reference sequences for 15,000 gene families; a database of transcriptome and genomes from 1,000 taxa with GFs assigned; and a highly-curated set of MSA and gene trees for 500 GFs in these taxa. We also demonstrate a suite of stand-alone utilities that provide basic statistics on sequences, analyze compositional/codon patterns, and enable exploration of trees (e.g. clade-grabbing and efficient tip labeling). We exemplify the power of PhyloToL 6.0 in estimating eukaryotic phylogeny using the 500 conserved GFs, and set standards for curation of omics data for future research in the field.
Read full abstract