Abstract

The standard workhorse for genomic analysis of the evolution of bacterial populations is phylogenetic modelling of mutations in the core genome. However, a notable amount of information about evolutionary and transmission processes in diverse populations can be lost unless the accessory genome is also taken into consideration. Here, we introduce panini (Pangenome Neighbour Identification for Bacterial Populations), a computationally scalable method for identifying the neighbours for each isolate in a data set using unsupervised machine learning with stochastic neighbour embedding based on the t-SNE (t-distributed stochastic neighbour embedding) algorithm. panini is browser-based and integrates with the Microreact platform for rapid online visualization and exploration of both core and accessory genome evolutionary signals, together with relevant epidemiological, geographical, temporal and other metadata. Several case studies with single- and multi-clone pneumococcal populations are presented to demonstrate the ability to identify biologically important signals from gene content data. panini is available at http://panini.pathogen.watch and code at http://gitlab.com/cgps/panini.

Highlights

  • In less than a decade, bacterial population genomics has progressed from the sequencing of dozens to thousands of strains [1,2,3,4]

  • To demonstrate utility within population genomics, firstly, we explore how the method performs in a simulated setting, where the relationship between all sequences is known; and we extend our analysis to published bacterial population data sets, allowing

  • The rapid increase in sampling density of bacterial populations for epidemiological and evolutionary studies highlights the need of combining traditional genomic markers, such as single nucleotide polymorphism (SNP) loci and small insertions or deletions in coding regions, with measures of difference in terms of gene content

Read more

Summary

INTRODUCTION

In less than a decade, bacterial population genomics has progressed from the sequencing of dozens to thousands of strains [1,2,3,4]. While trees are very useful, they are in general estimated using only core-genome variation (i.e. those regions of the genome common to all members of a sample), which may represent only a fraction of the relevant differences present in genomes across the study population. Several recent studies highlight the importance of considering variation in gene content when investigating the ecological and evolutionary processes leading to the observed data [6, 7]. We demonstrate the biological utility of such an approach by application to multiple population data sets

AND RESULTS
Findings
DISCUSSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.