Improving the Measurement of Semantic Similarity between Gene Ontology Terms and Gene Products: Insights from an Edge- and IC-Based Hybrid Method

Xiaomei Wu,Kui Lin,Zhen-Ming Pei,Erli Pang,Peter Csermely

doi:10.1371/journal.pone.0066745

Xiaomei Wu, Kui Lin + Show 3 more

Open Access

https://doi.org/10.1371/journal.pone.0066745

Copy DOI

Abstract

BackgroundExplicit comparisons based on the semantic similarity of Gene Ontology terms provide a quantitative way to measure the functional similarity between gene products and are widely applied in large-scale genomic research via integration with other models. Previously, we presented an edge-based method, Relative Specificity Similarity (RSS), which takes the global position of relevant terms into account. However, edge-based semantic similarity metrics are sensitive to the intrinsic structure of GO and simply consider terms at the same level in the ontology to be equally specific nodes, revealing the weaknesses that could be complemented using information content (IC).Results and ConclusionsHere, we used the IC-based nodes to improve RSS and proposed a new method, Hybrid Relative Specificity Similarity (HRSS). HRSS outperformed other methods in distinguishing true protein-protein interactions from false. HRSS values were divided into four different levels of confidence for protein interactions. In addition, HRSS was statistically the best at obtaining the highest average functional similarity among human-mouse orthologs. Both HRSS and the groupwise measure, simGIC, are superior in correlation with sequence and Pfam similarities. Because different measures are best suited for different circumstances, we compared two pairwise strategies, the maximum and the best-match average, in the evaluation. The former was more effective at inferring physical protein-protein interactions, and the latter at estimating the functional conservation of orthologs and analyzing the CESSM datasets. In conclusion, HRSS can be applied to different biological problems by quantifying the functional similarity between gene products. The algorithm HRSS was implemented in the C programming language, which is freely available from http://cmb.bnu.edu.cn/hrss.

Highlights

With the advent of high-throughput technologies such as DNA and RNA sequencing and microarray, automatic genome annotation of large sets of genes has been increasingly used
Scoring physical protein-protein interactions Gene Ontology (GO)-based semantic similarity has been recognized as one of the strongest indicators for scoring and predicting protein-protein interactions, based on the following two observations: two proteins acting in the same biological process are more likely to interact than proteins involved in different processes [9]; and to interact physically, proteins should exist in close proximity, at least transiently, which suggests that colocalization may serves as a useful predictor for protein interactions [37]
Relative Specificity Similarity (RSS) method was successfully applied to the prediction of genomescale protein-protein interactions in yeast by combining the maximum RSS values of all term pairs associated with any two proteins for the biological process (BP) and cellular component (CC) ontologies [4,38]

Summary

Introduction

With the advent of high-throughput technologies such as DNA and RNA sequencing and microarray, automatic genome annotation of large sets of genes has been increasingly used. The Gene Ontology (GO) [1] system is one such scheme that is widely becoming the de facto standard for facilitating information searches across databases and for aiding the annotation of molecular features in different model organisms. The valuable functional knowledge encoded in GO should be useful for developing new predictive systems to compare gene products at the functional level, which may be integrated with other models in large-scale genomic research. Explicit comparisons based on the semantic similarity of Gene Ontology terms provide a quantitative way to measure the functional similarity between gene products and are widely applied in large-scale genomic research via integration with other models. Edge-based semantic similarity metrics are sensitive to the intrinsic structure of GO and consider terms at the same level in the ontology to be specific nodes, revealing the weaknesses that could be complemented using information content (IC)

Objectives

Methods

Results

Conclusion