Abstract

BackgroundPinpointing genes involved in inherited human diseases remains a great challenge in the post-genomics era. Although approaches have been proposed either based on the guilt-by-association principle or making use of disease phenotype similarities, the low coverage of both diseases and genes in existing methods has been preventing the scan of causative genes for a significant proportion of diseases at the whole-genome level.ResultsTo overcome this limitation, we proposed a rigorous statistical method called pgFusion to prioritize candidate genes by integrating one type of disease phenotype similarity derived from the Unified Medical Language System (UMLS) and seven types of gene functional similarities calculated from gene expression, gene ontology, pathway membership, protein sequence, protein domain, protein-protein interaction and regulation pattern, respectively. Our method covered a total of 7,719 diseases and 20,327 genes, achieving the highest coverage thus far for both diseases and genes. We performed leave-one-out cross-validation experiments to demonstrate the superior performance of our method and applied it to a real exome sequencing dataset of epileptic encephalopathies, showing the capability of this approach in finding causative genes for complex diseases. We further provided the standalone software and online services of pgFusion at http://bioinfo.au.tsinghua.edu.cn/jianglab/pgfusion.ConclusionspgFusion not only provided an effective way for prioritizing candidate genes, but also demonstrated feasible solutions to two fundamental questions in the analysis of big genomic data: the comparability of heterogeneous data and the integration of multiple types of data. Applications of this method in exome or whole genome sequencing studies would accelerate the finding of causative genes for human diseases. Other research fields in genomics could also benefit from the incorporation of our data fusion methodology.

Highlights

  • Pinpointing genes involved in inherited human diseases remains a great challenge in the postgenomics era

  • Candidate genes can be ranked according to their functional similarity to a set of seed genes that are known to be associated with the query disease

  • In existing studies belonging to this category, such similarities have been quantified based on gene expression [6], gene ontology [7], protein sequences [8], protein-protein interactions [9], and many others [10,11,12]

Read more

Summary

Introduction

Pinpointing genes involved in inherited human diseases remains a great challenge in the postgenomics era. Approaches have been proposed either based on the guilt-by-association principle or making use of disease phenotype similarities, the low coverage of both diseases and genes in existing methods has been preventing the scan of causative genes for a significant proportion of diseases at the whole-genome level. Via genome-wide association (GWA) studies, genetic factors related to a query disease can typically be located within a region of 10M basepairs, containing. Targeting on these demands, two groups of computational approaches have been proposed for the prioritization of candidate genes. The requirement of a predefined set of seed genes may greatly restrict the scope of applications of these methods, since according to the OMIM (Online Mendelian Inheritance in Man) database [14], genetic bases for a significant proportion of human diseases are completely unknown, making the selection of seed genes for such diseases a problem

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.