Abstract

BackgroundEukaryotes acquired the trait of oxygenic photosynthesis through endosymbiosis of the cyanobacterial progenitor of plastid organelles. Despite recent advances in the phylogenomics of Cyanobacteria, the phylogenetic root of plastids remains controversial. Although a single origin of plastids by endosymbiosis is broadly supported, recent phylogenomic studies are contradictory on whether plastids branch early or late within Cyanobacteria. One underlying cause may be poor fit of evolutionary models to complex phylogenomic data.ResultsUsing Posterior Predictive Analysis, we show that recently applied evolutionary models poorly fit three phylogenomic datasets curated from cyanobacteria and plastid genomes because of heterogeneities in both substitution processes across sites and of compositions across lineages. To circumvent these sources of bias, we developed CYANO-MLP, a machine learning algorithm that consistently and accurately phylogenetically classifies (“phyloclassifies”) cyanobacterial genomes to their clade of origin based on bioinformatically predicted function-informative features in tRNA gene complements. Classification of cyanobacterial genomes with CYANO-MLP is accurate and robust to deletion of clades, unbalanced sampling, and compositional heterogeneity in input tRNA data. CYANO-MLP consistently classifies plastid genomes into a late-branching cyanobacterial sub-clade containing single-cell, starch-producing, nitrogen-fixing ecotypes, consistent with metabolic and gene transfer data.ConclusionsPhylogenomic data of cyanobacteria and plastids exhibit both site-process heterogeneities and compositional heterogeneities across lineages. These aspects of the data require careful modeling to avoid bias in phylogenomic estimation. Furthermore, we show that amino acid recoding strategies may be insufficient to mitigate bias from compositional heterogeneities. However, the combination of our novel tRNA-specific strategy with machine learning in CYANO-MLP appears robust to these sources of bias with high accuracy in phyloclassification of cyanobacterial genomes. CYANO-MLP consistently classifies plastids as late-branching Cyanobacteria, consistent with independent evidence from signature-based approaches and some previous phylogenetic studies.

Highlights

  • Eukaryotes acquired the trait of oxygenic photosynthesis through endosymbiosis of the cyanobacterial progenitor of plastid organelles

  • We show that our main result of the late-branching cyanobacterial phyloclassification of plastids is robust to the deletion of clades included in the phyloclassifier model, unbalanced sampling of genomes across clades, and to compositional heterogeneity in input Transfer ribonucleic acid (tRNA) gene data

  • We found that the empirical matrix model with site-rate heterogeneity LG+4 [36], which was applied to cyanobacterial and plastid data in Shih et al [10], Ponce-Toledo et al [12], and Ochoa de Alda et al.[14], fits site-process heterogeneity in all three phylogenomic datasets poorly (Fig. 1A and Additional file 1: Table S1)

Read more

Summary

Introduction

Eukaryotes acquired the trait of oxygenic photosynthesis through endosymbiosis of the cyanobacterial progenitor of plastid organelles. A single origin of plastids by endosymbiosis is broadly supported, recent phylogenomic studies are contradictory on whether plastids branch early or late within Cyanobacteria. Evolutionary evidence from more signaturebased approaches based on binary characters such as the presence or absence of endosymbiotic gene transfers [27], eukaryotic glycogen and starch pathways [28, 29], and conserved indels [30] more consistently point toward a late-branching origin of plastids. We show first that recently published phylogenomic datasets previously assembled from cyanobacterial and plastid genomes to address the root of plastids poorly fit the evolutionary models and character recoding strategies applied to them, which may help explain why earlier studies have reached contradictory conclusions with strong support. To "phyloclassify" a query genome to its clade of origin, the input data vector of CYANO-MLP respectively scores the query tRNA gene complement against eight sub-clade-specific structurefunction maps for tRNAs called function logos, and the Class-Informative Features (CIFs) they contain [31]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.