Automatic identification of optimal marker genes for phenotypic and taxonomic groups of microorganisms.

Elad Segev,Tom Ben Sasson,Edouard Jurkevitch,Mira Gonen,Zohar Pasternak,Christos A Ouzounis

doi:10.1371/journal.pone.0195537

Abstract

Finding optimal markers for microorganisms important in the medical, agricultural, environmental or ecological fields is of great importance. Thousands of complete microbial genomes now available allow us, for the first time, to exhaustively identify marker proteins for groups of microbial organisms. In this work, we model the biological task as the well-known mathematical “hitting set” problem, solving it based on both greedy and randomized approximation algorithms. We identify unique markers for 17 phenotypic and taxonomic microbial groups, including proteins related to the nitrite reductase enzyme as markers for the non-anammox nitrifying bacteria group, and two transcription regulation proteins, nusG and yhiF, as markers for the Archaea and Escherichia/Shigella taxonomic groups, respectively. Additionally, we identify marker proteins for three subtypes of pathogenic E. coli, which previously had no known optimal markers. Practically, depending on the completeness of the database this algorithm can be used for identification of marker genes for any microbial group, these marker genes may be prime candidates for the understanding of the genetic basis of the group's phenotype or to help discover novel functions which are uniquely shared among a group of microbes. We show that our method is both theoretically and practically efficient, while establishing an upper bound on its time complexity and approximation ratio; thus, it promises to remain efficient and permit the identification of marker proteins that are specific to phenotypic or taxonomic groups, even as more and more bacterial genomes are being sequenced.

Highlights

The first complete bacterial genome sequence was published in 1995 [1]
The solution is usually to increase the number of genes, so in core genome MLST (cgMLST), between 1500–3000 marker genes are used, which increases the discriminative power but forces any new isolate to be fully sequenced before it can be typed, requiring complex genomic analysis
In order to fully challenge our methods and algorithms, we implemented the algorithm using 17 different microbial groups which represent a wide variety of classification criteria: non-anammox nitrifying bacteria and predatory bacteria as phenotypic groups, Archaea and Escherichia/Shigella as taxonomic Groups, and 13 different subtypes of pathogenic E. coli as taxonomic/phenotypic groups

Summary

Introduction

The first complete bacterial genome sequence was published in 1995 [1]. Since sequencing technology has developed rapidly, causing a dramatic reduction in the cost of sequencing, which made bacterial genome sequencing affordable to a great number of labs [2]. A different way to improve MLST is by discarding the usage of housekeeping genes in favor of small groups of genes that are unique to specific taxonomic or phenotypic groups This allows quick and affordable typing, using PCR instead of whole-genome sequencing, while retaining high discriminative power. In S1 Algorithm, we show that even if we are willing to relax our problem to that of finding a hitting set of a limited size, an exact approach is impractical Since this problem is of great importance, a lot of effort has been made to find efficient approximation algorithms to it. We apply an approximation algorithm which finds relatively small sets of proteins that identify the group of interest Both algorithms were used to identify nonanammox nitrifying bacteria and predatory bacteria as phenotypic groups, Archaea and Escherichia/Shigella as taxonomic groups, and 13 different pathogenic sub-groups of E. coli as combined phenotypic/taxonomic groups

Materials and methods

Solve the following fractional linear programming problem

Results and discussion