Rational confederation of genes and diseases: NGS interpretation via GeneCards, MalaCards and VarElect

Noa Rappaport,Simon Fishilevich,Frida Belinky,Tsippi Iny Stein,Ron Nudel,Dana Cohen,Michal Twik,Inbar Plaschkes,Marilyn Safran,Danit Oz-Levi,Doron Lancet

doi:10.1186/s12938-017-0359-2

Abstract

BackgroundA key challenge in the realm of human disease research is next generation sequencing (NGS) interpretation, whereby identified filtered variant-harboring genes are associated with a patient’s disease phenotypes. This necessitates bioinformatics tools linked to comprehensive knowledgebases. The GeneCards suite databases, which include GeneCards (human genes), MalaCards (human diseases) and PathCards (human pathways) together with additional tools, are presented with the focus on MalaCards utility for NGS interpretation as well as for large scale bioinformatic analyses.ResultsVarElect, our NGS interpretation tool, leverages the broad information in the GeneCards suite databases. MalaCards algorithms unify disease-related terms and annotations from 69 sources. Further, MalaCards defines hierarchical relatedness—aliases, disease families, a related diseases network, categories and ontological classifications. GeneCards and MalaCards delineate and share a multi-tiered, scored gene-disease network, with stringency levels, including the definition of elite status—high quality gene-disease pairs, coming from manually curated trustworthy sources, that includes 4500 genes for 8000 diseases. This unique resource is key to NGS interpretation by VarElect. VarElect, a comprehensive search tool that helps infer both direct and indirect links between genes and user-supplied disease/phenotype terms, is robustly strengthened by the information found in MalaCards. The indirect mode benefits from GeneCards’ diverse gene-to-gene relationships, including SuperPaths—integrated biological pathways from 12 information sources. We are currently adding an important information layer in the form of “disease SuperPaths”, generated from the gene-disease matrix by an algorithm similar to that previously employed for biological pathway unification. This allows the discovery of novel gene-disease and disease–disease relationships. The advent of whole genome sequencing necessitates capacities to go beyond protein coding genes. GeneCards is highly useful in this respect, as it also addresses 101,976 non-protein-coding RNA genes. In a more recent development, we are currently adding an inclusive map of regulatory elements and their inferred target genes, generated by integration from 4 resources.ConclusionsMalaCards provides a rich big-data scaffold for in silico biomedical discovery within the gene-disease universe. VarElect, which depends significantly on both GeneCards and MalaCards power, is a potent tool for supporting the interpretation of wet-lab experiments, notably NGS analyses of disease. The GeneCards suite has thus transcended its 2-decade role in biomedical research, maturing into a key player in clinical investigation.

Highlights

A key challenge in the realm of human disease research is genera‐ tion sequencing (NGS) interpretation, whereby identified filtered variant-harboring genes are associated with a patient’s disease phenotypes
We provide an example for deciphering a specific genetic disease using MalaCards, via our VarElect bioinformatic next genera‐ tion sequencing (NGS) interpretation pipeline, which utilizes several other GeneCards suite tools
The MalaCards disease universe To help overcome the impediment of disease name unification stemming from source heterogeneity, we obtained 85,000 disease terms from 15 sources that were examined in a predefined order of importance, and used text unification heuristics to define 19,289 main names and their associated 65,000 aliases

Summary

Introduction

A key challenge in the realm of human disease research is genera‐ tion sequencing (NGS) interpretation, whereby identified filtered variant-harboring genes are associated with a patient’s disease phenotypes. This necessitates bioinfor‐ matics tools linked to comprehensive knowledgebases. Different methods may identify such associations, including genome-wide association studies (GWAS), classical genetic studies, transcriptomics and proteomics, functional molecular studies and literature text mining [1]. Such heterogeneous datasets should be cleverly integrated to allow gene prioritization. There is a need for heuristics that connect the realm of NGS with such data structures

Methods

Results

Conclusion