Abstract
Phylogenetics is an important area of evolutionary biology that helps to understand the origin and divergence of genes, genomes and species. Building meaningful phylogenetic trees is needed for the accurate reconstruction of the past. To achieve a correct phylogenetic understanding of genes or proteins, reliable and robust methods are needed to construct meaningful trees. With the rapidly increasing availability of genome and transcriptome sequencing data, there is a need for efficient and accurate methodologies for ancestral state reconstruction. Currently available methods are mostly specific for certain gene families, and require substantial adaptation for their application to other gene families. Hence, a generalized framework is essential to utilize large transcriptome resources such as OneKP and MMETSP. Here, we have developed a flexible yet efficient method, based on core strengths such as emphasis on being inclusive in homolog selection, and defining orthologs based on multi-layered inferences. We illustrate how specific steps can be modified to fit the needs of any protein family under consideration. We also demonstrate the success of this protocol by studying and testing the orthologs in various gene families. Taken together, we present a protocol for reconstructing the ancestral states of various domains and proteins across multiple kingdoms of eukaryotes, using thousands of transcriptomes.
Highlights
OneKP represents the majority of the land plants and algal groups, whereas Marine Micro Eukaryote Transcriptome Sequencing Project (MMETSP) covers majority of the SAR group and other phyla in Chromista
We developed a unified framework to build high-resolution phylogenies that utilize the rich OneKP and MMETSP transcriptome resources
This protocol is built on three core strengths: (1) Inclusive: Include more sequences at the start with liberal parameters, and remove sequences as one goes through various steps in the pipeline, resulting in a high-quality logical sequence set for phylogenetic tree construction
Summary
Majority of the mentioned programs in Software section run only on Linux environment; it is recommended to perform the analysis on a Linux machine with access to the BASH shell (terminal). RAxML v8 (Stamatakis, 2014) (https://cme.h-its.org/exelixis/web/software/raxml/index.html). Linux BASH shell (terminal) ‘cut, sort and uniq’ functions (https://tiswww.case.edu/php/chet/bash/bashref.html). OneKP dataset (1000 plant transcriptomes project): Contains 1341 transcriptomes from 1179 species covering all the major classes of land plants, green algae, red algae and glaucophytes (Carpenter et al, 2019; One Thousand Plant Transcriptomes Initiative, 2019); http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/oneKP_c apstone_2019 2. MMETSP dataset (Marine Microbial Eukaryote Transcriptome Sequencing Project): Contains 678 transcriptomes from 410 species covering all the major classes of Stramenopila and Alveolata (SAR group) and many unclassified (unicellular) marine eukaryotes (Keeling et al, 2014); https://gold.jgi.doe.gov/study?id=Gs0128947
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.