Abstract

BackgroundIdentifying disease gene from a list of candidate genes is an important task in bioinformatics. The main strategy is to prioritize candidate genes based on their similarity to known disease genes. Most of existing gene prioritization methods access only one genomic data source, which is noisy and incomplete. Thus, there is a need for the integration of multiple data sources containing different information.ResultsIn this paper, we proposed a combination strategy, called discounted rating system (DRS). We performed leave one out cross validation to compare it with N-dimensional order statistics (NDOS) used in Endeavour. Results showed that the AUC (Area Under the Curve) values achieved by DRS were comparable with NDOS on most of the disease families. But DRS worked much faster than NDOS, especially when the number of data sources increases. When there are 100 candidate genes and 20 data sources, DRS works more than 180 times faster than NDOS. In the framework of DRS, we give different weights for different data sources. The weighted DRS achieved significantly higher AUC values than NDOS.ConclusionThe proposed DRS algorithm is a powerful and effective framework for candidate gene prioritization. If weights of different data sources are proper given, the DRS algorithm will perform better.

Highlights

  • Identifying disease gene from a list of candidate genes is an important task in bioinformatics

  • Methods we firstly introduce the data used in this work: disease genes, protein-protein interaction (PPI) data and gene ontology (GO) [15]

  • The Protein Protein Interaction (PPI) data were presented by the PPI network, and random walk with restart (RWR) algorithm was directly used on the network

Read more

Summary

Introduction

Identifying disease gene from a list of candidate genes is an important task in bioinformatics. Most of existing gene prioritization methods access only one genomic data source, which is noisy and incomplete. There is a need for the integration of multiple data sources containing different information. A pertinent role for bioinformatics research exists in the analysis of biological data for disease gene discovery. Most current efforts at disease-gene identification involving linkage analysis and association studies result in a genomic interval of 0.5-10 centi Morgen containing up to 300 genes [1,2]. These candidate genes need to be further investigated to identify disease causing genes.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call