Abstract

With the advancement of new high throughput sequencing technologies, there has been an increase in the number of genome sequencing projects worldwide, which has yielded complete genome sequences of human, animals and plants. Subsequently, several labs have focused on genome annotation, consisting of assigning functions to gene products, mostly using Gene Ontology (GO) terms. As a consequence, there is an increased heterogeneity in annotations across genomes due to different approaches used by different pipelines to infer these annotations and also due to the nature of the GO structure itself. This makes a curator's task difficult, even if they adhere to the established guidelines for assessing these protein annotations. Here we develop a genome-scale approach for integrating GO annotations from different pipelines using semantic similarity measures. We used this approach to identify inconsistencies and similarities in functional annotations between orthologs of human and Drosophila melanogaster, to assess the quality of GO annotations derived from InterPro2GO mappings compared to manually annotated GO annotations for the Drosophila melanogaster proteome from a FlyBase dataset and human, and to filter GO annotation data for these proteomes. Results obtained indicate that an efficient integration of GO annotations eliminates redundancy up to 27.08 and 22.32% in the Drosophila melanogaster and human GO annotation datasets, respectively. Furthermore, we identified lack of and missing annotations for some orthologs, and annotation mismatches between InterPro2GO and manual pipelines in these two proteomes, thus requiring further curation. This simplifies and facilitates tasks of curators in assessing protein annotations, reduces redundancy and eliminates inconsistencies in large annotation datasets for ease of comparative functional genomics.

Highlights

  • The development of fast and relatively inexpensive sequencing technologies has yielded complete genome sequences of thousands of organisms

  • These quality assessment measures include the specificity or the level of detail of Gene Ontology (GO) annotations used, the consistency of these annotations in terms of similarity between terms used, the non-redundancy of the annotation set and annotation match and mismatch scores. These measures are defined using an abstract semantic similarity measure, but here we are using the GO-universal metric (Mazandu and Mulder, 2012) when applied to the Drosophila melanogaster and human proteomes with GO annotations manually assigned with the following GO evidence codes (Experimental category): Inferred from Experiment (EXP), Inferred from Direct Assay (IDA), Inferred from Physical Interaction (IPI), Inferred from Mutant Phenotype (IMP), Inferred from Genetic Interaction (IGI) and Inferred from Expression Pattern (IEP), and annotations originating from InterPro2GO, referred to as electronic inference

  • For 47592 total reviewed entries of human proteome from UniProt, 29844, 36177, and 31683 proteins are characterized with respect to the BP, MF, and CC ontologies, respectively, among which 6507, 8665, and 7416 entries contain annotations manually assigned, and 12422, 21989, and 8725 entries with annotations inferred electronically using InterPro2GO mappings

Read more

Summary

Introduction

The development of fast and relatively inexpensive sequencing technologies has yielded complete genome sequences of thousands of organisms. Several sequence databases store these sequences, including GenBank Benson et al (2009), Ensembl Flicek et al (2010); Fernández-Suárez and Schuster (2010); Spudich and Fernández-Suárez (2010), NCBI (Pruitt et al, 2005; Sayers et al, 2009) and the UniProt database, which is an integrated repository of protein sequence and function created by joining the information contained in the SwissProt, TrEMBL, and PIR proteins databases (Jain et al, 2009; UniProt-Consortium, 2010) In these databases, an increased deficiency in functional annotation was observed for many sequenced proteins as approximately 20–50% of proteins within a genome were still labeled “unknown,” “uncharacterized” or “hypothetical” (Mazandu and Mulder, 2012). These include electronic annotation methods, such as Ensembl Compara, InterPro, UniProtKB/Swiss-Prot Keywords (SPKW), UniProtKB-Subcellular Location (SPSL), UniPathway, Enzyme Commission (EC), and High-quality Automated and Manual Annotation of Microbial Proteomes (HAMAP), and manual annotation efforts, such as the Gene Ontology Annotation project, the Reference Genome Annotation Initiative

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call