Abstract

The pan-genome is defined as the combined set of all genes in the gene pool of a species. Pan-genome analyses have been very useful in helping to understand different evolutionary dynamics of bacterial species: an open pan-genome often indicates a free-living lifestyle with metabolic versatility, while closed pan-genomes are linked to host-restricted, ecologically specialized bacteria. A detailed understanding of the species pan-genome has also been instrumental in tracking the phylodynamics of emerging drug resistance mechanisms and drug-resistant pathogens. However, current approaches to analyse a species’ pan-genome do not take the species population structure into account, nor do they account for the uneven sampling of different lineages, as is commonplace due to over-sampling of clinically relevant representatives. Here we present the application of a population structure-aware approach for classifying genes in a pan-genome based on within-species distribution. We demonstrate our approach on a collection of 7500 Escherichia coli genomes, one of the most-studied bacterial species and used as a model for an open pan-genome. We reveal clearly distinct groups of genes, clustered by different underlying evolutionary dynamics, and provide a more biologically informed and accurate description of the species’ pan-genome.

Highlights

  • Advances in whole genome sequencing in the last two decades and the ability to sequence multiple isolates of the same species have revealed that, often, only a small fraction of genes are shared by all species members

  • To demonstrate how one can refine a pan-g­ enome description while accounting for population structure, we used a recently published genome collection that includes over 7500 E. coli and Shigella sp. genomes isolated from human hosts, referred to as the Horesh collection [11]

  • The genomes in the Horesh collection were collated from publications and other public resources, representing the known diversity of the clinical E. coli isolate genomes available in public databases, and underwent quality-­control steps to ensure a final set of high-q­uality genomes

Read more

Summary

Introduction

Advances in whole genome sequencing in the last two decades and the ability to sequence multiple isolates of the same species have revealed that, often, only a small fraction of genes are shared by all species members. Measuring gene frequencies across the whole dataset does not account for the population structure or biased sampling of the genomes in the dataset Such simple classification can be problematic when the population of interest consists of multiple deep-b­ ranching lineages that are unevenly represented in the collection. If 50 % of a genome collection is represented by one lineage that was heavily over-s­ampled compared to other lineages, and all isolates of that lineage have a particular gene which is absent in all other lineages, this gene will be defined as an ‘intermediate’ gene. Based on these definitions alone, it would not be differentiated from a gene that is found in all isolates of

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.