Abstract

Protein functional similarity based on gene ontology (GO) annotations serves as a powerful tool when comparing proteins on a functional level in applications such as protein-protein interaction prediction, gene prioritization, and disease gene discovery. Functional similarity (FS) is usually quantified by combining the GO hierarchy with an annotation corpus that links genes and gene products to GO terms. One large group of algorithms involves calculation of GO term semantic similarity (SS) between all the terms annotating the two proteins, followed by a second step, described as “mixing strategy”, which involves combining the SS values to yield the final FS value. Due to the variability of protein annotation caused e.g. by annotation bias, this value cannot be reliably compared on an absolute scale. We therefore introduce a similarity z-score that takes into account the FS background distribution of each protein. For a selection of popular SS measures and mixing strategies we demonstrate moderate accuracy improvement when using z-scores in a benchmark that aims to separate orthologous cases from random gene pairs and discuss in this context the impact of annotation corpus choice. The approach has been implemented in Frela, a fast high-throughput public web server for protein FS calculation and interpretation.

Highlights

  • Gene products can be compared in many different ways, researchers for example have been performing comparisons between proteins based on the amino acid similarity[1] for many years

  • We have compared the performance of selected functional similarity measures in discriminating pairs of orthologous genes from random pairs and utilized protein-based z-scores as a means to improve the discriminatory power of these Functional similarity (FS) measures

  • We have chosen the following six pairwise semantic similarity (SS) measures: Resnik, Lin, Schlicker, information coefficient, Jiang and Conrath, and graph information content. We combined these SS measures with each of the following five mixing strategies: average, maximum, maximum of best matches, best matches averaged, and mean of best matches, which gives a total of 30 FS measures that were investigated in our study

Read more

Summary

OPEN Exploring Approaches for Detecting

Several reviews are available[6,7,8,9], which attempt to classify the different methods into groups with related SS measures These measures make use of the structure of the DAG and/or combine this with information from a GO annotation corpus, which provides the mapping between GO terms and gene products. Protein annotation is biased and is influenced by different research interests, with model organisms of human disease for example being better annotated[17] and promising gene products (e.g. disease associated genes) or specific gene families having a higher number of annotations These biases have been analysed over time[18] and lead to correlations between the number of GO terms a protein is annotated with, which in turn affects applications that involve SS measures[19]. The software, including the source code for the web server, is available for download from our web server

Results and Discussion
Methods
Author Contributions
Additional Information
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call