Defining the Estimated Core Genome of Bacterial Populations Using a Bayesian Decision Model

Andries J Van Tonder,Angela B Brueggemann,Chris L Farmer,Martin C J Maiden,Dorothea M C Hill,Anne Von Gottberg,Julian Parkhill,Stephen D Bentley,James E Bray,Alison J Cody,Keith A Jolley,Shilan Mistry,Keith P Klugman,Christos A Ouzounis

doi:10.1371/journal.pcbi.1003788

Abstract

The bacterial core genome is of intense interest and the volume of whole genome sequence data in the public domain available to investigate it has increased dramatically. The aim of our study was to develop a model to estimate the bacterial core genome from next-generation whole genome sequencing data and use this model to identify novel genes associated with important biological functions. Five bacterial datasets were analysed, comprising 2096 genomes in total. We developed a Bayesian decision model to estimate the number of core genes, calculated pairwise evolutionary distances (p-distances) based on nucleotide sequence diversity, and plotted the median p-distance for each core gene relative to its genome location. We designed visually-informative genome diagrams to depict areas of interest in genomes. Case studies demonstrated how the model could identify areas for further study, e.g. 25% of the core genes with higher sequence diversity in the Campylobacter jejuni and Neisseria meningitidis genomes encoded hypothetical proteins. The core gene with the highest p-distance value in C. jejuni was annotated in the reference genome as a putative hydrolase, but further work revealed that it shared sequence homology with beta-lactamase/metallo-beta-lactamases (enzymes that provide resistance to a range of broad-spectrum antibiotics) and thioredoxin reductase genes (which reduce oxidative stress and are essential for DNA replication) in other C. jejuni genomes. Our Bayesian model of estimating the core genome is principled, easy to use and can be applied to large genome datasets. This study also highlighted the lack of knowledge currently available for many core genes in bacterial genomes of significant global public health importance.

Highlights

The advent of next-generation sequencing (NGS) has greatly increased the number of bacterial genomes sequenced and made available for study in public databases such as GenBank, the Sequence Read Archive and European Nucleotide Archive (ENA) [1,2,3]
We developed a simple statistical model to estimate the number of core genes in a bacterial genome dataset, calculated pairwise evolutionary distances (p-distances) based on differences among nucleotide sequences, and plotted the median p-distance for each core gene relative to its genome location
Description of the datasets used in analyses In total, 2096 genomes were analysed across the 5 different bacterial species (Table 1 and Datasets S1)

Summary

Introduction

The advent of next-generation sequencing (NGS) has greatly increased the number of bacterial genomes sequenced and made available for study in public databases such as GenBank, the Sequence Read Archive and European Nucleotide Archive (ENA) [1,2,3]. Any collection of isolates is a subset of the entire population for the species of interest, and if the subset of isolates has limited genetic diversity the number of ‘‘core’’ genes shared by all isolates in that sample will be higher than in a dataset which is genetically more diverse. This is not necessarily a problem, unless the intention is to extrapolate the findings to the wider bacterial population.

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS computational biology	Publication Date: Aug 21, 2014
Citations: 86	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Defining the Estimated Core Genome of Bacterial Populations Using a Bayesian Decision Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS computational biology

Lead the way for us

Similar Papers

Modeling of the GC content of the substituted bases in bacterial core genomes
Jon Bohlin ... Ola Brynildsrud
BMC Genomics | VOL. 19
Jon Bohlin, et. al.Jon Bohlin ... Ola Brynildsrud
06 Aug 2018
BMC Genomics | VOL. 19

Estimation of AT and GC content distributions of nucleotide substitution rates in bacterial core genomes
Jon Bohlin ... John H.-O Pettersson
Big Data Analytics | VOL. 4
Jon Bohlin, et. al.Jon Bohlin ... John H.-O Pettersson
14 Aug 2019
Big Data Analytics | VOL. 4

Leveraging comparative genomics to uncover alien genes in bacterial genomes.
Soham Sengupta ... Rajeev K Azad
Microbial Genomics | VOL. 9
Soham Sengupta, et. al.Soham Sengupta ... Rajeev K Azad
27 Jan 2023
Microbial Genomics | VOL. 9

BcgTree: automatized phylogenetic tree building from bacterial core genomes.
Markus J Ankenbrand ... Frédéric Chain
Genome | VOL. 59
Markus J Ankenbrand, et. al.Markus J Ankenbrand ... Frédéric Chain
11 May 2016
Genome | VOL. 59

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Defining the Estimated Core Genome of Bacterial Populations Using a Bayesian Decision Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS computational biology