Abstract

BackgroundLiterature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated.ResultsIn this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions.ConclusionsTwo models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. Our approach demonstrates clear value for human-in-the-loop curation scenarios.

Highlights

  • Literature-based gene ontology (GO) annotations (GOA) are produced by reviewing the description of experiments in research papers, selecting appropriate GO terms for the experimental findings, and labelling the annotation with a GO evidence code1 [5,6,7,8] indicating the nature of the evidence

  • We introduce different methods for measuring the semantic similarity between naturally written texts within documents, or different GO terms, modelled on a Directed Acyclic Graph (DAG); some of these are used in our methods

  • The training set optimisation and the addition of evidence section information further contribute to improving the Precision (+ 0.2 & + 0.15 ) in recognising consistent GOA

Read more

Summary

Introduction

Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. GO annotation of genes involves two major components: the GO information, which includes GO terms and their definitions or descriptions, and supportive evidence, which includes the coding regions on a genomic sequence, or a reference to a document describing experimental findings relating to gene product function. Literature-based GO annotations (GOA) are produced by reviewing the description of experiments in research papers, selecting appropriate GO terms for the experimental findings, and labelling the annotation with a GO evidence code1 [5,6,7,8] indicating the nature of the evidence. There is a pressing need to implement reliable tools for automatic curation of GOA as the volume of biological data is constantly increasing

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call