Abstract
BackgroundGenomics and proteomics experiments produce a large amount of data that are awaiting functional elucidation. An important step in analyzing such data is to identify functional units, which consist of proteins that play coherent roles to carry out the function. Importantly, functional coherence is not identical with functional similarity. For example, proteins in the same pathway may not share the same Gene Ontology (GO) terms, but they work in a coordinated fashion so that the aimed function can be performed. Thus, simply applying existing functional similarity measures might not be the best solution to identify functional units in omics data.ResultsWe have designed two scores for quantifying the functional coherence by considering association of GO terms observed in two biological contexts, co-occurrences in protein annotations and co-mentions in literature in the PubMed database. The counted co-occurrences of GO terms were normalized in a similar fashion as the statistical amino acid contact potential is computed in the protein structure prediction field. We demonstrate that the developed scores can identify functionally coherent protein sets, i.e. proteins in the same pathways, co-localized proteins, and protein complexes, with statistically significant score values showing a better accuracy than existing functional similarity scores. The scores are also capable of detecting protein pairs that interact with each other. It is further shown that the functional coherence scores can accurately assign proteins to their respective pathways.ConclusionWe have developed two scores which quantify the functional coherence of sets of proteins. The scores reflect the actual associations of GO terms observed either in protein annotations or in literature. It has been shown that they have the ability to accurately distinguish biologically relevant groups of proteins from random ones as well as a good discriminative power for detecting interacting pairs of proteins. The scores were further successfully applied for assigning proteins to pathways.
Highlights
Genomics and proteomics experiments produce a large amount of data that are awaiting functional elucidation
The Co-occurrence Association Score (CAS) quantifies the frequency of Gene Ontology (GO) terms that co-occur in the gene annotations, while the PubMed Association Score (PAS) takes into account co-occurrence of GO terms in the PubMed abstracts
The Gene Ontology database used in this study contains 17,316 Biological Process (BP), 2,534 Cellular Component (CC), and 9,428 Molecular Function (MF) domain terms, which result in a total of 29,278 terms
Summary
Genomics and proteomics experiments produce a large amount of data that are awaiting functional elucidation. Realizing weaknesses of conventional homology search methods, e.g. limited coverage in genome annotations and the need for homologous proteins [17,18,19,20], various new approaches for function prediction have been developed in the past decade Those include methods which use the sequence information in an elaborated fashion [21,22,23,24,25,26,27], those which compare the global and local tertiary structure information [8], and methods which use large-scale experimental data of proteins [11,28,29,30,31,32,33,34,35]. Clustering genes by functional similarity is an indispensable step in finding the underlying biological principles behind the observed data
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.