GO functional similarity clustering depends on similarity measure, clustering method, and annotation completeness

Meng Liu,Paul D Thomas

doi:10.1186/s12859-019-2752-2

Abstract

BackgroundBiological knowledge, and therefore Gene Ontology annotation sets, for human genes is incomplete. Recent studies have reported that biases in available GO annotations result in biased estimates of functional similarities of genes, but it is still unclear what the effect of incompleteness itself may be, even in the absence of bias. Pairwise gene similarities are used in a number of contexts, including gene “functional similarity” clustering and the related problem of functional ontology structure inference, but it is not known how different similarity measures or clustering methods perform on this task, and how the clusters are affected by annotation completeness.ResultsWe developed representations of both “complete” and “incomplete” GO annotation datasets based on experimentally-supported annotations from the GO database—specifically designed to model the incompleteness of human gene annotations—and computed semantic similarities for each set using a variety of different published measures. We then assessed the clusters derived from these measures using two different clustering methods: hierarchical clustering, and the CliXO algorithm. We find the CliXO algorithm, combined with appropriate measures, performs better than hierarchical clustering in reconstructing GO both when the data are complete, and incomplete. Some measures, particularly those that create a pairwise gene similarity by averaging over all pairwise annotation similarities, had consistently poor performance, and a few measures, such as Lin best-matched average and Relevance maximum, were generally among the best performers for a broad range in annotation completeness and types of GO classes. Finally, we show that for semantic similarity-based clustering, the multicellular organism process branch of the GO biological process ontology is more challenging to represent than the cellular process branch.ConclusionsWe assessed the effects of annotation completeness on the distribution of pairwise gene semantic similarity scores, and subsequent effects on the clusters derived from these scores. Our results suggest combinations of semantic similarity measures, gene-level scoring methods and clustering method that perform best for functional gene clustering using annotation sets of varying completeness. Overall, our results underscore the importance of increasing the completeness of GO annotations to for supporting computational analyses of gene function.

Highlights

Biological knowledge, and Gene Ontology annotation sets, for human genes is incomplete
We focus on Gene Ontology (GO) biological process annotations; we recognize that GO biological processes span multiple levels of biological organization, so we consider separately GO cellular processes and GO multicellular organism-level processes
We assessed the effects of annotation completeness on the distribution of pairwise gene semantic similarity scores, and subsequent effects on the clusters derived from these scores

Summary

Introduction

Biological knowledge, and Gene Ontology annotation sets, for human genes is incomplete. The Gene Ontology (GO), a standardized vocabulary of biological function and process terms, is one of the most frequently used resources for gene function annotations [1]. It consists of 3 domains: molecular function (how a gene functions at the molecular level, e.g. a protein kinase), cellular component (location relative to cell compartments and structures where the gene product is active, e.g. the plasma membrane) and biological process (what larger processes a gene product helps to carry out). It is common to use the GO in many applications, including gene set enrichment [2,3,4,5], gene network [6, 7] and pathway analysis [8, 9]

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Mar 27, 2019
Citations: 20	License type: open-access

R Discovery Prime

R Discovery Prime

GO functional similarity clustering depends on similarity measure, clustering method, and annotation completeness

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Correlation between Gene Expression and GO Semantic Similarity
J.L Sevilla ... A Podhorski
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 2
J.L Sevilla, et. al.J.L Sevilla ... A Podhorski
01 Oct 2005
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 2

Interspecies gene function prediction using semantic similarity
Guoxian Yu ... Jun Wang
BMC Systems Biology | VOL. 10
Guoxian Yu, et. al.Guoxian Yu ... Jun Wang
01 Dec 2016
BMC Systems Biology | VOL. 10

Experimental study on short-text clustering using transformer-based semantic similarity measure
Khaled Abdalgader ... Khaled Hossin
PeerJ Computer Science | VOL. 10
Khaled Abdalgader, et. al.Khaled Abdalgader ... Khaled Hossin
29 May 2024
PeerJ Computer Science | VOL. 10

Measuring semantic similarities by combining gene ontology annotations and gene co-function networks.
Jiajie Peng ... Seung Y Rhee
BMC Bioinformatics | VOL. 16
Jiajie Peng, et. al.Jiajie Peng ... Seung Y Rhee
14 Feb 2015
BMC Bioinformatics | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

GO functional similarity clustering depends on similarity measure, clustering method, and annotation completeness

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics