Modularity-based credible prediction of disease genes and detection of disease subtypes on the phenotype-gene heterogeneous network

Shao Li,Xin Yao,Han Hao,Yanda Li

doi:10.1186/1752-0509-5-79

Abstract

BackgroundProtein-protein interaction networks and phenotype similarity information have been synthesized together to discover novel disease-causing genes. Genetic or phenotypic similarities are manifested as certain modularity properties in a phenotype-gene heterogeneous network consisting of the phenotype-phenotype similarity network, protein-protein interaction network and gene-disease association network. However, the quantitative analysis of modularity in the heterogeneous network and its influence on disease-gene discovery are still unaddressed. Furthermore, the genetic correspondence of the disease subtypes can be identified by marking the genes and phenotypes in the phenotype-gene network. We present a novel network inference method to measure the network modularity, and in particular to suggest the subtypes of diseases based on the heterogeneous network.ResultsBased on a measure which is introduced to evaluate the closeness between two nodes in the phenotype-gene heterogeneous network, we developed a Hitting-Time-based method, CIPHER-HIT, for assessing the modularity of disease gene predictions and credibly prioritizing disease-causing genes, and then identifying the genetic modules corresponding to potential subtypes of the queried phenotype. The CIPHER-HIT is free to rely on any preset parameters. We found that when taking into account the modularity levels, the CIPHER-HIT method can significantly improve the performance of disease gene predictions, which demonstrates modularity is one of the key features for credible inference of disease genes on the phenotype-gene heterogeneous network. By applying the CIPHER-HIT to the subtype analysis of Breast cancer, we found that the prioritized genes can be divided into two sub-modules, one contains the members of the Fanconi anemia gene family, and the other contains a reported protein complex MRE11/RAD50/NBN.ConclusionsThe phenotype-gene heterogeneous network contains abundant information for not only disease genes discovery but also disease subtypes detection. The CIPHER-HIT method presented here is effective for network inference, particularly on credible prediction of disease genes and the subtype analysis of diseases, for example Breast cancer. This method provides a promising way to analyze heterogeneous biological networks, both globally and locally.

Highlights

Protein-protein interaction networks and phenotype similarity information have been synthesized together to discover novel disease-causing genes
CIPHER-HIT: the topological closeness measure based on the Mean-Hitting-Time The CIPHER method [2] and the random walk with restart method (RWR) [3,4] are the approaches which reflect the global structural information of the phenotype-gene heterogeneous network, while the parameters such as the restart rate in random walk with restarts (RWR), which are related to the performance, are required to be pre-set
In the CIPHER-HIT method, we present a new closeness measure between two nodes based on the Mean-HittingTime of the random walk on the heterogeneous network

Summary

Introduction

Protein-protein interaction networks and phenotype similarity information have been synthesized together to discover novel disease-causing genes. Network-based evidence as well as inference approaches has become more and more attractive in the research field of disease-causing gene discovery, and a variety of methods “phenotype-gene heterogeneous network” reflects a holistic view of complex relationships among various phenotypes and phenotypes, phenotypes and genes, as well as genes and genes, which consists of the phenotypephenotype similarity network, gene-disease association network and protein-protein interaction network, respectively. Based on such a heterogeneous network, we propose a regression model named CIPHER (Correlating protein Interaction network and PHEnotype network to pRedict disease genes) to quantify the concordance between candidate genes and target phenotypes [2]. After the similarity information between the phenotypes is provided by van Driel et al through text mining technology [17], the phenotype similarity and the protein-protein interactions are combined together for the prioritization of the candidate disease genes [1,2,4]

Methods

Results

Conclusion