Abstract

Microbial community metagenomes and individual microbial genomes are becoming increasingly accessible by means of high-throughput sequencing. Assessing organismal membership within a community is typically performed using one or a few taxonomic marker genes such as the 16S rDNA, and these same genes are also employed to reconstruct molecular phylogenies. There is thus a growing need to bioinformatically catalog strongly conserved core genes that can serve as effective taxonomic markers, to assess the agreement among phylogenies generated from different core gene, and to characterize the biological functions enriched within core genes and thus conserved throughout large microbial clades. We present a method to recursively identify core genes (i.e. genes ubiquitous within a microbial clade) in high-throughput from a large number of complete input genomes. We analyzed over 1,100 genomes to produce core gene sets spanning 2,861 bacterial and archaeal clades, ranging in size from one to >2,000 genes in inverse correlation with the α-diversity (total phylogenetic branch length) spanned by each clade. These cores are enriched as expected for housekeeping functions including translation, transcription, and replication, in addition to significant representations of regulatory, chaperone, and conserved uncharacterized proteins. In agreement with previous manually curated core gene sets, phylogenies constructed from one or more of these core genes agree with those built using 16S rDNA sequence similarity, suggesting that systematic core gene selection can be used to optimize both comparative genomics and determination of microbial community structure. Finally, we examine functional phylogenies constructed by clustering genomes by the presence or absence of orthologous gene families and show that they provide an informative complement to standard sequence-based molecular phylogenies.

Highlights

  • The number of fully sequenced microbial genomes recently passed one thousand, and the number of metagenomically sequenced microbial communities is growing rapidly [1,2]

  • We functionally characterized the core genes and compared them to a functional phylogeny constructed by joining organisms with similar pathway and orthologous gene family complements

  • We applied our method to all microbial genomes currently available from the NCBI, determining core genes conserved at each clade within the NCBI Taxonomy (Figure 1)

Read more

Summary

Introduction

The number of fully sequenced microbial genomes recently passed one thousand, and the number of metagenomically sequenced microbial communities is growing rapidly [1,2]. Gene families strongly conserved within groups of related microbial organisms serve several important purposes in biologically interpreting these data Their sequence variation can be used to reconstruct molecular phylogenies describing the evolutionary relationship among microorganisms [3,4]. The variable regions of gene sequences shared by broad groups of bacteria or archaea can be used as taxonomic markers to determine their presence and abundance within microbial communities [6]. Each of these applications has been highly successful when employing manually curated core gene sets [3,7], but as detailed below, current computational techniques rarely scale to thousands of complete genomes. Methods for rapidly cataloging gene families conserved within microbial clades are needed in order to take advantage of this growing number of sequenced organisms and communities

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call