Self-prediction of relations in GO facilitates its quality auditing

Cheng Chen,Lingyun Luo,Chunlei Zheng,Pingjian Ding,Huan Liu,Hanyu Luo

doi:10.1016/j.jbi.2023.104441

Abstract

As applications of the gene ontology (GO) increase rapidly in the biomedical field, quality auditing of it is becoming more and more important. Existing auditing methods are mostly based on rules, observed patterns or hypotheses. In this study, we propose a machine-learning-based framework for GO to audit itself: we first predict the IS-A relations among concepts in GO, then use differences between predicted results and existing relations to uncover potential errors. Specifically, we transfer the taxonomy of GO 2020 January release into a dataset with concept pairs as items and relations between them as labels(pairs with no direct IS-A relation are labeled as ndrs). To fully obtain the representation of each pair, we integrate the embeddings for the concept name, concept definition, as well as concept node in a substring-based topological graph. We divide the dataset into 10 parts, and rotate over all the parts by choosing one part as the testing set and the remaining as the training set each time. After 10 rotations, the prediction model predicted 4,640 existing IS-A pairs as ndrs. In the GO 2022 March release, 340 of these predictions were validated, demonstrating significance with a p-value of 1.60e−46 when compared to the results of randomly selected pairs. On the other hand, the model predicted 2,840 out of 17,079 selected ndrs in GO to be IS-A’s relations. After deleting those that caused redundancies and circles, 924 predicted IS-A’s relations remained. Among 200 pairs randomly selected, 30 were validated as missing IS-A’s by domain experts. In conclusion, this study investigates a novel way of auditing biomedical ontologies by predicting the relations in it, which was shown to be useful for discovering potential errors.

Full Text