UniProt-DAAC: domain architecture alignment and classification, a new method for automatic functional annotation in UniProtKB.

Tunca Doğan,Maria J Martin,Claire O’Donovan,Alistair Macdougall,Alex Bateman,Diego Poggioli,Rabie Saidi

doi:10.1093/bioinformatics/btw114

Tunca Doğan, Maria J Martin + Show 5 more

Open Access

https://doi.org/10.1093/bioinformatics/btw114

Copy DOI

Abstract

Motivation: Similarity-based methods have been widely used in order to infer the properties of genes and gene products containing little or no experimental annotation. New approaches that overcome the limitations of methods that rely solely upon sequence similarity are attracting increased attention. One of these novel approaches is to use the organization of the structural domains in proteins.Results: We propose a method for the automatic annotation of protein sequences in the UniProt Knowledgebase (UniProtKB) by comparing their domain architectures, classifying proteins based on the similarities and propagating functional annotation. The performance of this method was measured through a cross-validation analysis using the Gene Ontology (GO) annotation of a sub-set of UniProtKB/Swiss-Prot. The results demonstrate the effectiveness of this approach in detecting functional similarity with an average F-score: 0.85. We applied the method on nearly 55.3 million uncharacterized proteins in UniProtKB/TrEMBL resulted in 44 818 178 GO term predictions for 12 172 114 proteins. 22% of these predictions were for 2 812 016 previously non-annotated protein entries indicating the significance of the value added by this approach.Availability and implementation: The results of the method are available at: ftp://ftp.ebi.ac.uk/pub/contrib/martin/DAAC/.Contact: tdogan@ebi.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online.

Highlights

The reduction in the cost of sequencing has led to the accumulation of a vast amount of data in biological databases
The number of unique domain architectures/arrangements (DA) generated in UniProtKB/Swiss-Prot is $13% of the number of entries with domain hits. This rate is only 2% for UniProtKB/ TrEMBL, and the reason for this can be attributed to the higher redundancy in UniProtKB/TrEMBL compared to UniProtKB/SwissProt
A cross-validation experiment was carried out in order to observe the performance of Domain Architecture Alignment and Classification (DAAC) on data with known labels

Summary

Introduction

The reduction in the cost of sequencing has led to the accumulation of a vast amount of data in biological databases. These data are stored in public repositories such as the UniProt Knowledgebase (UniProt Consortium, 2015) for protein sequences, and NCBI GenBank (Benson et al, 2008) and the EMBL Nucleotide Archive (Leinonen et al, 2011) for gene sequences. Defining the functions of genes and gene products is a difficult task due to the biological complexity of organisms. There are various projects aiming to standardize the description of the functional attributes of biological sequences by introducing controlled vocabularies. The Gene Ontology (GO) project provides the most comprehensive functional standardization system for proteins (Gene Ontology Consortium, 2015). GO uses a directed acyclic graph (DAG) structure to define the functions from generic to specific in three main categories namely: molecular function, biological process and cellular component

Methods

Results

Conclusion