ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples

Fantine Mordelet,Jean-Philippe Vert

doi:10.1186/1471-2105-12-389

Abstract

BackgroundElucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases.ResultsWe propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.ConclusionsProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at http://cbio.ensmp.fr/prodige.

Highlights

Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology
As a gold standard we extracted all known disease-gene associations from the OMIM database [25], and we borrowed from [7] nine sources of information about the genes, including expression profiles in various experiments, functional annotations, known protein-protein interactions (PPI), transcriptional motifs, protein domain activity and literature data
We compare two ways to perform data integration: first by averaging the nine kernel functions, and second by letting ProDiGe optimize itself the relative contribution of each source of information when the model is estimated, through a multiple kernel learning (MKL) approach. We compare both variants with the best model of [10], namely, the MKL1Class model which differs from ProDiGe in this case only in the machine learning paradigm implemented: while ProDiGe learns a model from positive and unlabeled examples, MKL1class learns it only from positive examples

Summary

Introduction

Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains timeconsuming and expensive. Considerable efforts have been made to elucidate the genetic basis of rare and common human diseases. Traditional approaches to discover disease genes first identify chromosomal regions likely to contain the gene of interest, e.g., by linkage analysis or study of chromosomal aberrations in DNA samples from large case-control populations. The regions identified, often contain tens to hundreds of candidate genes. Finding the causal gene(s) among these candidates is an expensive and timeconsuming process, which requires extensive laboratory experiments. Progresses in sequencing, microarray or proteomics technologies have facilitated the

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 6, 2011
Citations: 183	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning.
Hyebin Song ... Bennett J Bremer
Cell systems | VOL. 12
Hyebin Song, et. al.Hyebin Song ... Bennett J Bremer
18 Nov 2020
Cell systems | VOL. 12

GPS: Identification of disease genes by rank aggregation of multi-genomic scoring schemes
Alireza Meshkin ... Ali Masoudi-Nejad
Genomics | VOL. 111
Alireza Meshkin, et. al.Alireza Meshkin ... Ali Masoudi-Nejad
28 Mar 2018
Genomics | VOL. 111

Genetic Renal Abnormalities

Medicine | VOL. 31

01 May 2003
Medicine | VOL. 31

Towards Prediction and Prioritization of disease genes by the modularity of human phenome-genome assembled network
Jeffrey Q Jiang ... Andreas W M Dress
Journal of Integrative Bioinformatics | VOL. 7
Jeffrey Q Jiang, et. al.Jeffrey Q Jiang ... Andreas W M Dress
01 Jun 2010
Journal of Integrative Bioinformatics | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics