CMonkey2: Automated, systematic, integrated detection of co-regulated gene modules for any organism.

David J Reiss,Christopher L Plaisier,Wei-Ju Wu,Nitin S Baliga

doi:10.1093/nar/gkv300

David J Reiss, Christopher L Plaisier + Show 2 more

Open Access

https://doi.org/10.1093/nar/gkv300

Copy DOI

Abstract

The cMonkey integrated biclustering algorithm identifies conditionally co-regulated modules of genes (biclusters). cMonkey integrates various orthogonal pieces of information which support evidence of gene co-regulation, and optimizes biclusters to be supported simultaneously by one or more of these prior constraints. The algorithm served as the cornerstone for constructing the first global, predictive Environmental Gene Regulatory Influence Network (EGRIN) model for a free-living cell, and has now been applied to many more organisms. However, due to its computational inefficiencies, long run-time and complexity of various input data types, cMonkey was not readily usable by the wider community. To address these primary concerns, we have significantly updated the cMonkey algorithm and refactored its implementation, improving its usability and extendibility. These improvements provide a fully functioning and user-friendly platform for building co-regulated gene modules and the tools necessary for their exploration and interpretation. We show, via three separate analyses of data for E. coli, M. tuberculosis and H. sapiens, that the updated algorithm and inclusion of novel scoring functions for new data types (e.g. ChIP-seq and transcription factor over-expression [TFOE]) improve discovery of biologically informative co-regulated modules. The complete cMonkey2 software package, including source code, is available at https://github.com/baliga-lab/cmonkey2.

Highlights

It is widely acknowledged that gene regulatory networks (GRNs) are inherently modular in nature and organized hierarchically [1,2,3]
In order to assess the ramifications of the algorithm changes which we made to cMonkey2, we evaluated its performance relative to both cMonkey1, to other popular clustering methods––k-means [41] and WGCNA [42], and to published data integration/module detection algorithms COALESCE [11], DISTILLER [10] and LeMoNe [4]
To further test the capability of the cMonkey2 set-enrichment scoring function to improve detection of experimentally validated regulons, we investigated its influence on modules detected for Mycobacterium tuberculosis, using a large gene expression compendium and new global ChIP-seq and transcription factor overexpression (TFOE) measurements

Summary

Introduction

It is widely acknowledged that gene regulatory networks (GRNs) are inherently modular in nature and organized hierarchically [1,2,3] This modular structure results from the regulation of genes by distinct combinations of regulatory factors; transcripts regulated by the same (set of) factor(s) are presumed to express similar patterns of differential expression over different cellular and environmental conditions. Identifying co-regulated gene modules can significantly reduce the complexity of the problem of inference of genome-wide GRNs from data, and they can be exploited to greatly improve the accuracy of the inferred regulatory network topology [4,5,6,7] For this reason, the detection of co-regulated gene modules via integrated modeling of multiple supporting data types has been an active research topic for more than a decade. Many of these are implemented as complex command-line tools or graphical user-interfaces that can only be applied to a limited predefined set of model organisms

Methods

Results

Conclusion