Abstract
BackgroundComparative genomics can leverage the vast amount of available genomic sequences to reconstruct and analyze transcriptional regulatory networks in Bacteria, but the efficacy of this approach hinges on the ability to transfer regulatory network information from reference species to the genomes under analysis. Several methods have been proposed to transfer regulatory information between bacterial species, but the paucity and distributed nature of experimental information on bacterial transcriptional networks have prevented their systematic evaluation.ResultsWe report the compilation of a large catalog of transcription factor-binding sites across Bacteria and its use to systematically benchmark proposed transfer methods across pairs of bacterial species. We evaluate motif- and accuracy-based metrics to assess the results of regulatory network transfer and we identify the precision-recall area-under-the-curve as the best metric for this purpose due to the large class-imbalanced nature of the problem. Methods assuming conservation of the transcription factor-binding motif (motif-based) are shown to substantially outperform those assuming conservation of regulon composition (network-based), even though their efficiency can decrease sharply with increasing phylogenetic distance. Variations of the basic motif-based transfer method do not yield significant improvements in transfer accuracy. Our results indicate that detection of a large enough number of regulated orthologs is critical for network-based transfer methods, but that relaxing orthology requirements does not improve results. Using the transcriptional regulators LexA and Fur as case examples, we also show how DNA-binding domain sequence similarity can yield confounding results as an indicator of transfer efficiency for motif-based methods.ConclusionsCounter to standard practice, our evaluation of metrics to assess the efficiency of methods for regulatory network information transfer reveals that the area under precision-recall (PR) curves is a more precise and informative metric than that of receiver-operating-characteristic (ROC) curves, confirming similar findings in other class-imbalanced settings. Our systematic assessment of transfer methods reveals that simple approaches to both motif- and network-based transfer of regulatory information provide equal or better results than more elaborate methods. We also show that there are not effective predictors of transfer efficacy, substantiating the long-standing practice of manual curation in comparative genomics analyses.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1113-7) contains supplementary material, which is available to authorized users.
Highlights
Comparative genomics can leverage the vast amount of available genomic sequences to reconstruct and analyze transcriptional regulatory networks in Bacteria, but the efficacy of this approach hinges on the ability to transfer regulatory network information from reference species to the genomes under analysis
Data compilation and evaluation of metrics for the assessment of transfer methods To perform a systematic analysis of methods for the transfer of transcriptional regulatory networks in Bacteria, we compiled data from five major databases reporting experimentally-validated transcription factor (TF)-binding sites across the Bacteria domain
We focused on the Euclidean distance and the Kullback–Leibler (KL) divergence as well-established motif comparison functions based on the positionspecific frequency matrix (PSFM) defined by the motif [23], and on two standard metrics for classification accuracy based on the area-under-the curve (AUC) derived from a TF-binding site search process: the receiveroperating-characteristic (ROC) AUC and the precisionrecall (PR) AUC [24, 25]
Summary
Comparative genomics can leverage the vast amount of available genomic sequences to reconstruct and analyze transcriptional regulatory networks in Bacteria, but the efficacy of this approach hinges on the ability to transfer regulatory network information from reference species to the genomes under analysis. Comparative genomics approaches have been routinely employed to study bacterial transcriptional regulatory networks, or regulons, controlled by a single transcription factor (TF). These studies have enabled the identification of core network elements and niche-specific adaptations, providing insights into the evolution of these systems [2,3,4,5,6,7]. The first step consists in the transfer of available information on the regulatory network (i.e. known TF-binding sites and/or regulated genes) to the species under analysis, in order to infer the TF-binding motif in these target species. Search results from multiple genomes are integrated across orthologs, based on the assumption that only orthologs of regulated genes will systematically display TF-binding sites in their promoter regions
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have