Abstract
Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.
Highlights
The foundation of the Gene Ontology (GO) Consortium was a critical step toward the adoption of formal and objective knowledge representations in biological sciences and addressed the need for knowledge sharing and functional comparisons in the face of the rapid growth of genomic sequence data [1]
We made three additions to the Association rule learning (ARL) methodology to improve the performance of our GO relationship learning (GRL) algorithm, taking into account the nature of the data and the type of relationship we are interested in capturing
Based on the criterion that a protein is incompletely annotated if it has any non-redundant annotation to a term with more than 10 descendents, we estimate that 64% of the proteins are incompletely annotated and 68% of the MFclasses correspond to incomplete protein functions
Summary
The foundation of the Gene Ontology (GO) Consortium was a critical step toward the adoption of formal and objective knowledge representations in biological sciences and addressed the need for knowledge sharing and functional comparisons in the face of the rapid growth of genomic sequence data [1].GO is currently the de facto standard for functional annotation of gene products in the categories molecular function, biological process, and cellular component. The ontology is under constant development because both our knowledge of biological phenomena and our ability to represent that knowledge are continuously growing [2]. While the ontology development is carried out by human curators, it can be assisted by computational approaches that facilitate handling the increasing size and complexity. In this context, the use of the association rule learning methodology has been proposed to identify relationships between GO terms with the goal of enriching the ontology [3,4]. The ongoing extension of GO with computable logical definitions will enable the partial automation of the development of the ontology and facilitate the identification of errors and missing relationships [6]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.