Abstract

Bacteria and archaea can exchange genetic material across lineages through processes of lateral genetic transfer (LGT). Collectively, these exchange relationships can be modeled as a network and analyzed using concepts from graph theory. In particular, densely connected regions within an LGT network have been defined as genetic exchange communities (GECs). However, it has been problematic to construct networks in which edges solely represent LGT. Here we apply term frequency-inverse document frequency (TF-IDF), an alignment-free method originating from document analysis, to infer regions of lateral origin in bacterial genomes. We examine four empirical datasets of different size (number of genomes) and phyletic breadth, varying a key parameter (word length k) within bounds established in previous work. We map the inferred lateral regions to genes in recipient genomes, and construct networks in which the nodes are groups of genomes, and the edges natively represent LGT. We then extract maximum and maximal cliques (i.e., GECs) from these graphs, and identify nodes that belong to GECs across a wide range of k. Most surviving lateral transfer has happened within these GECs. Using Gene Ontology enrichment tests we demonstrate that biological processes associated with metabolism, regulation and transport are often over-represented among the genes affected by LGT within these communities. These enrichments are largely robust to change of k.

Highlights

  • Bacteria and archaea (BA) comprise much of the planet’s biodiversity

  • We analyze four datasets, three of which we introduced earlier (Cong et al, 2016b): 20 Escherichia coli and seven Shigella genomes (ECS dataset), 110 enteric bacterial genomes (EB) and 143 genomes from BA. To these we add a dataset of 144 bacterial genomes (BAC) purpose-built for this analysis

  • We recognize groups within the ECS dataset according to multi-locus sequence type (MLST; Gordon et al, 2008); within EB by genus, sometimes combining Escherichia and Shigella genomes into a single group; within BA by phylum, or alternatively by class; and within BAC by order

Read more

Summary

Introduction

Bacteria and archaea (BA) comprise much of the planet’s biodiversity. Communities of these organisms are responsible for key biological and geochemical processes including nitrogen fixation, aerobic and anaerobic digestion of biomass, and oxidative dissolution of minerals. Bacteria cause a range of diseases in plants, animals, and humans. Since 1996, genome-sequencing technologies have been applied initially to study bacterial pathogenesis, and more recently to understand environmental processes and explore biodiversity. Genome sequences are publicly available for more than 30,000 BA, and large international projects are underway to sequence many thousands more

Objectives
Methods
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call