Abstract

BackgroundPrioritizing genetic variants is a challenge because disease susceptibility loci are often located in genes of unknown function or the relationship with the corresponding phenotype is unclear. A global data-mining exercise on the biomedical literature can establish the phenotypic profile of genes with respect to their connection to disease phenotypes. The importance of protein-protein interaction networks in the genetic heterogeneity of common diseases or complex traits is becoming increasingly recognized. Thus, the development of a network-based approach combined with phenotypic profiling would be useful for disease gene prioritization.ResultsWe developed a random-set scoring model and implemented it to quantify phenotype relevance in a network-based disease gene-prioritization approach. We validated our approach based on different gene phenotypic profiles, which were generated from PubMed abstracts, OMIM, and GeneRIF records. We also investigated the validity of several vocabulary filters and different likelihood thresholds for predicted protein-protein interactions in terms of their effect on the network-based gene-prioritization approach, which relies on text-mining of the phenotype data. Our method demonstrated good precision and sensitivity compared with those of two alternative complex-based prioritization approaches. We then conducted a global ranking of all human genes according to their relevance to a range of human diseases. The resulting accurate ranking of known causal genes supported the reliability of our approach. Moreover, these data suggest many promising novel candidate genes for human disorders that have a complex mode of inheritance.ConclusionWe have implemented and validated a network-based approach to prioritize genes for human diseases based on their phenotypic profile. We have devised a powerful and transparent tool to identify and rank candidate genes. Our global gene prioritization provides a unique resource for the biological interpretation of data from genome-wide association studies, and will help in the understanding of how the associated genetic variants influence disease or quantitative phenotypes.Electronic supplementary materialThe online version of this article (doi:10.1186/1471-2105-15-315) contains supplementary material, which is available to authorized users.

Highlights

  • Prioritizing genetic variants is a challenge because disease susceptibility loci are often located in genes of unknown function or the relationship with the corresponding phenotype is unclear

  • We implemented and validated a random-set scoring model for a network-based gene prioritization approach. This approach uses biomedical records (e.g., OMIM, PubMed, and GeneRIF) as phenotypic profile for candidate genes to infer their association with diseases

  • The candidate gene is prioritized as a gene complex based on the physical and functional protein-protein interaction (PPI) network from STRING [12,13,14,15,16,17,18]

Read more

Summary

Introduction

Prioritizing genetic variants is a challenge because disease susceptibility loci are often located in genes of unknown function or the relationship with the corresponding phenotype is unclear. A global data-mining exercise on the biomedical literature can establish the phenotypic profile of genes with respect to their connection to disease phenotypes. Identifying the causative variant(s) is still a daunting task, as the mechanisms through which the variants influence disease or quantitative phenotypes are often unclear, pathways. This means that, once a gene complex with members involved in one disease has been identified, the other members of the complex become candidates for having a biological relationship with the same disease. A global examination of biological textual data will establish the phenotypic profile of genes with respect to their connection to disease biology. The phenotypic profile of a gene, referred to as the gene-associated phenotype, can be obtained by large-scale text-mining of biomedical records using information extraction and retrieval techniques, and filtering the biomedical terms with specific vocabularies such as that from the Unified Medical Language System (UMLS) [7]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call