GeneTIER: prioritization of candidate disease genes using tissue-specific gene expression profiles.

Agne Antanaviciute,Catherine Daly,David T Bonthron,Christopher M Watson,Alexander F Markham,Ian M Carr,Laura A Crinnion

doi:10.1093/bioinformatics/btv196

Abstract

Motivation: In attempts to determine the genetic causes of human disease, researchers are often faced with a large number of candidate genes. Linkage studies can point to a genomic region containing hundreds of genes, while the high-throughput sequencing approach will often identify a great number of non-synonymous genetic variants. Since systematic experimental verification of each such candidate gene is not feasible, a method is needed to decide which genes are worth investigating further. Computational gene prioritization presents itself as a solution to this problem, systematically analyzing and sorting each gene from the most to least likely to be the disease-causing gene, in a fraction of the time it would take a researcher to perform such queries manually.Results: Here, we present Gene TIssue Expression Ranker (GeneTIER), a new web-based application for candidate gene prioritization. GeneTIER replaces knowledge-based inference traditionally used in candidate disease gene prioritization applications with experimental data from tissue-specific gene expression datasets and thus largely overcomes the bias toward the better characterized genes/diseases that commonly afflict other methods. We show that our approach is capable of accurate candidate gene prioritization and illustrate its strengths and weaknesses using case study examples.Availability and Implementation: Freely available on the web at http://dna.leeds.ac.uk/GeneTIER/.Contact: umaan@leeds.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online.

Highlights

Current high-throughput sequencing methods used for disease gene discovery can generate very large volumes of data
The algorithm used by GeneTIER assumes that a disease gene’s expression tends to be significantly higher in affected tissues compared with unaffected tissue
The line running from the origin (0,0) to the maximum point of 1,1 (Y 1⁄4 X), which corresponds to an area under the curve (AUC) of 0.5, represents a performance that is no better than random predictions

Summary

Introduction

Current high-throughput sequencing methods used for disease gene discovery can generate very large volumes of data. A common approach is to examine biological databases and literature for information pertaining to each candidate disease gene, in order to select the most promising genes. This can be both slow and error-prone, as the data are spread across multiple resources with no common structure. Nor can this type of analysis be quantified, since the selection is based solely on the subjective impressions of the researcher

Methods

Results

Conclusion