Procedure and datasets to compute links between genes and phenotypes defined by MeSH keywords

Erinija Pranckeviciene

doi:10.12688/f1000research.6140.1

Abstract

Algorithms mining relationships between genes and phenotypes can be classified into several overlapping categories based on how a phenotype is defined: by training genes known to be related to the phenotype; by keywords and algorithms designed to work with disease phenotypes. In this work an algorithm of linking phenotypes to Gene Ontology (GO) annotations is outlined, which does not require training genes and is based on algorithmic principles of Genes to Diseases (G2D) gene prioritization tool. In the outlined algorithm phenotypes are defined by terms of Medical Subject Headings (MeSH). GO annotations are linked to phenotypes through intermediate MeSH D terms of drugs and chemicals. This inference uses mathematical framework of fuzzy binary relationships based on fuzzy set theory. Strength of relationships between the terms is defined through frequency of co-occurrences of the pairs of terms in PubMed articles and a frequency of association between GO annotations and MeSH D terms in NCBI Gene gene2go and gene2pubmed datasets. Three plain tab-delimited datasets that are required by the algorithm are contributed to support computations. These datasets can be imported into a relational MySQL database. MySQL statements to create tables are provided. MySQL procedure implementing computations that are performed by outlined algorithm is listed. Plain tab-delimited format of contributed tables makes it easy to use this dataset in other applications.

Highlights

Understanding molecular mechanisms underlying both normal cellular processes and disease-causing gene perturbations has numerous applications in clinical diagnostics, personal genomics and engineering[1,2,3,4,5]
Most of the genomic studies address two major questions: (i) What genomic and molecular markers are associated with an observed phenotype? (ii) What molecular mechanisms lead to that phenotype in the studied organism? Answering these questions and uncovering gene-phenotype relationships mostly relies on experimental research that has already generated very large amounts of high-throughput data stored in public databases[6,7,8,9,10]
For this reason integrative algorithms to analyze high-throughput data by mining genomic databases and literature are in the focus of intensive research resulting in many publicly available bioinformatics tools for biologists and clinical researchers[6,13,14,15,16,17,18,19]

Summary

Introduction

Understanding molecular mechanisms underlying both normal cellular processes and disease-causing gene perturbations has numerous applications in clinical diagnostics, personal genomics and engineering[1,2,3,4,5]. To date there are more than 1500 databases hosting various types of genomic and molecular biology data[11] acompanied by increasing number of research publications analyzing newly-generated data[12]. For this reason integrative algorithms to analyze high-throughput data by mining genomic databases and literature are in the focus of intensive research resulting in many publicly available bioinformatics tools for biologists and clinical researchers[6,13,14,15,16,17,18,19].

Methods

Results

Discussion

Conclusion