Abstract

The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored.

Highlights

  • Microbial pangenomics has attracted interest over recent years, stimulated by the availability of sequence data from whole-genome re-sequencing projects[1,2,3,4,5,6,7]

  • There are more than a thousand E. coli genomic projects listed[10] and this number will grow in the near future, along with genomes for many other bacteria; it is reasonable to assume that pangenomics will attract more attention

  • The data are from the model organism Escherichia coli only, and many of the more specific results cannot be extended to other species

Read more

Summary

Introduction

Microbial pangenomics has attracted interest over recent years, stimulated by the availability of sequence data from whole-genome re-sequencing projects[1,2,3,4,5,6,7]. The pangenome diversity can be huge, which is reflected in the span of phenotypes An example of this is found in the model organism Escherichia coli[8,9]. The sample pangenome size is the total number of gene families found in the currently available genomes, while the population pangenome size is the number of gene families we expect to see if every single strain was sequenced. It is the latter which is of scientific interest, but it must be estimated from the former. A fraction of the gene families will be found only in a small number of genomes, and those observed in only a single genome are called ORFans

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call