Abstract
BackgroundIdentifying the key transcription factors (TFs) controlling a biological process is the first step toward a better understanding of underpinning regulatory mechanisms. However, due to the involvement of a large number of genes and complex interactions in gene regulatory networks, identifying TFs involved in a biological process remains particularly difficult. The challenges include: (1) Most eukaryotic genomes encode thousands of TFs, which are organized in gene families of various sizes and in many cases with poor sequence conservation, making it difficult to recognize TFs for a biological process; (2) Transcription usually involves several hundred genes that generate a combination of intrinsic noise from upstream signaling networks and lead to fluctuations in transcription; (3) A TF can function in different cell types or developmental stages. Currently, the methods available for identifying TFs involved in biological processes are still very scarce, and the development of novel, more powerful methods is desperately needed.ResultsWe developed a computational pipeline called TF-Cluster for identifying functionally coordinated TFs in two steps: (1) Construction of a shared coexpression connectivity matrix (SCCM), in which each entry represents the number of shared coexpressed genes between two TFs. This sparse and symmetric matrix embodies a new concept of coexpression networks in which genes are associated in the context of other shared coexpressed genes; (2) Decomposition of the SCCM using a novel heuristic algorithm termed "Triple-Link", which searches the highest connectivity in the SCCM, and then uses two connected TF as a primer for growing a TF cluster with a number of linking criteria. We applied TF-Cluster to microarray data from human stem cells and Arabidopsis roots, and then demonstrated that many of the resulting TF clusters contain functionally coordinated TFs that, based on existing literature, accurately represent a biological process of interest.ConclusionsTF-Cluster can be used to identify a set of TFs controlling a biological process of interest from gene expression data. Its high accuracy in recognizing true positive TFs involved in a biological process makes it extremely valuable in building core GRNs controlling a biological process. The pipeline implemented in Perl can be installed in various platforms.
Highlights
Identifying the key transcription factors (TFs) controlling a biological process is the first step toward a better understanding of underpinning regulatory mechanisms
We developed a novel approach for identifying TFs involved in a biological process by building a conceptually new coexpression network represented by shared coexpression connectivity matrix (SCCM) and decomposing it into multiple subnetworks using Triple-Link, a heuristic algorithm that works as follows: it first searches all connected node pairs in the SCCM, and identify the one with highest connectivity, which is used as a primer for growing into a TF cluster
Using the pipeline containing Spearman rank correlation, the coexpression analysis was applied to both human and Arabidopsis data sets, and a SCCM was built for human and Arabidopsis respectively
Summary
Identifying the key transcription factors (TFs) controlling a biological process is the first step toward a better understanding of underpinning regulatory mechanisms. Due to the involvement of a large number of genes and complex interactions in gene regulatory networks, identifying TFs involved in a biological process remains difficult. The methods available for identifying TFs involved in biological processes are still very scarce, and the development of novel, more powerful methods is desperately needed. Given that microarray data only measure a small component of the interacting variables in a genetic regulatory network[9] and that some portions of the nonlinear relationships between TF-targets are difficult to simulate and predict [10,11], identifying via TF-target modeling a short list of crucial TFs controlling biological processes in either mammals and plants is inefficient. As prior knowledge of target genes often do not exist, there is a need to develop new approaches for recognizing a short list of TFs controlling a biological process
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have