Bioinformatics approaches to the study of proteins yield to the introduction of different methodologies and related tools for the analysis of different types of data related to proteins, ranging from primary, secondary and tertiary structures to interaction data [1], not to mention functional knowledge. One of the most advanced tools for encoding and representing functional knowledge in a formal way is the Gene Ontology (GO) [2,3]. It is composed of three ontologies, named Biological Process (BP), Molecular Function (MF) and Cellular Component (CC). Each ontology consists of a set of terms (GO terms) representing different functions, biological processes and cellular components within the cell. GO terms are connected each other to form a hierarchical graph. Terms representing similar functions are close to each other within this graph. Biological molecules are associated with GO terms that represent their functions, biological roles and localization. This process, usually referred to as annotation process, can be performed under the supervision of an expert or in a fully automated way. Obviously, computationally inferred annotations, commonly known as Electronically Inferred Annotation (IEA), are not as reliable as experimentally determined annotations. For this reason every annotation is labeled with an Evidence Code (EC) that keeps track of the type of process used to produce the annotation itself. Considering the release of annotations of April, 2010, about the 98% of all the annotations is an IEA annotation [4]. The term annotation corpus is commonly used to identify all the annotations involving a set of proteins or genes, usually referring the whole proteomes and genomes (i.e. the annotation corpus of yeast). For lack of space we do not further describe the Gene Ontology. A comprehensive review has been provided by du Plessis et al. [4] and by Guzzi et al. [5]. The availability of well formalized functional data enabled the use of computational methods to analyse genes and proteins from the functional point of view. For example, a set of algorithms, known as functional enrichment algorithms, have been developed to determine the statistical significance of the presence (or the absence) of a GO Term in a set of gene products. A detailed review of these algorithms can be found in [4]. An interesting problem is how to express quantitatively the relationships between GO terms. Several measures, referred to as (term) semantic similarity (SS) measures, has been introduced in the last decade. Given two or more GO terms, they try to quantify the similarity of the functional aspects represented by the terms within the cell. Exploiting annotation corpora, semantic similarity measures have been further extended to the evaluation of the similarity of genes and proteins on the basis of their annotations. Many different works have focused on the following tasks: (i) the definition of ad-hoc semantic similarity measures tailored to the characteristics of Gene Ontology; (ii) the definition of measures of comparison of genes and proteins; (iii) the introduction of methodologies for the systematic assessment of semantic similarity measures; (iv) the use of semantic similarity measures in many different contexts and applications. Despite its relevance, the application of semantic similarity for the systematic analysis of protein data is still an open research area. There are, in fact, two main questions that have to be addressed: (i) the systematic assessment of SS with respect to other biological features, i.e. how much an high or a low value of SS is biologically meaningful; (ii) how reliable are the SS themselves, i.e. is there any systematic error or bias in the calculation of SS? Both these problems are relevant for the diffusion of SS measures; while in the first case several approaches have been proposed, confronting SS measures with a pletora of different biological features, only few works dealt with the second problem in a systematic way [5,6,7].
Read full abstract