Prediction and Validation of Gene-Disease Associations Using Methods Inspired by Social Network Analyses

U Martin Singh-Blom,Ambuj Tewari,John O Woods,Edward M Marcotte,Nagarajan Natarajan,Inderjit S Dhillon,Patrick Aloy

doi:10.1371/journal.pone.0058977

Abstract

Correctly identifying associations of genes with diseases has long been a goal in biology. With the emergence of large-scale gene-phenotype association datasets in biology, we can leverage statistical and machine learning methods to help us achieve this goal. In this paper, we present two methods for predicting gene-disease associations based on functional gene associations and gene-phenotype associations in model organisms. The first method, the Katz measure, is motivated from its success in social network link prediction, and is very closely related to some of the recent methods proposed for gene-disease association inference. The second method, called Catapult (Combining dATa Across species using Positive-Unlabeled Learning Techniques), is a supervised machine learning method that uses a biased support vector machine where the features are derived from walks in a heterogeneous gene-trait network. We study the performance of the proposed methods and related state-of-the-art methods using two different evaluation strategies, on two distinct data sets, namely OMIM phenotypes and drug-target interactions. Finally, by measuring the performance of the methods using two different evaluation strategies, we show that even though both methods perform very well, the Katz measure is better at identifying associations between traits and poorly studied genes, whereas Catapult is better suited to correctly identifying gene-trait associations overall.The authors want to thank Jon Laurent and Kris McGary for some of the data used, and Li and Patra for making their code available. Most of Ambuj Tewari's contribution to this work happened while he was a postdoctoral fellow at the University of Texas at Austin.

Highlights

Predicting new gene-disease associations has long been an important goal in computational biology
One of the most commonly used kinds of association is derived from direct protein-protein interactions, such as the ones curated by the Human Reference Protein Database (HPRD) [4]
Gene-disease association data can be thought of as a bipartite graph, where each gene and each disease is a node, and there is an edge between a gene node and a disease node if there is a known association between the gene and the disease

Summary

Introduction

Predicting new gene-disease associations has long been an important goal in computational biology. One kind of network that has proven to be useful for predicting biological function is the functional interaction network, where a pair of genes is connected based on the integrated evidence from a wide array of information sources, as seen by Lee at al.[9]. These have been used to associate genes with phenotypes in model organisms [10,11] and in humans [12,13]. Since functional gene interaction networks aggregate many different types of information, they can achieve much greater coverage than pure protein-protein interaction networks

Methods

Results

Conclusion