SUS-BAR: a database of pig proteins with statistically validated structural and functional annotation

D Piovesan,P L Martelli,L Fontanesi,R Casadio,P Fariselli,G Profiti

doi:10.1093/database/bat065

Abstract

Given the relevance of the pig proteome in different studies, including human complex maladies, a statistical validation of the annotation is required for a better understanding of the role of specific genes and proteins in the complex networks underlying biological processes in the animal. Presently, approximately 80% of the pig proteome is still poorly annotated, and the existence of protein sequences is routinely inferred automatically by sequence alignment towards preexisting sequences. In this article, we introduce SUS-BAR, a database that derives information mainly from UniProt Knowledgebase and that includes 26 206 pig protein sequences. In SUS-BAR, 16 675 of the pig protein sequences are endowed with statistically validated functional and structural annotation. Our statistical validation is determined by adopting a cluster-centric annotation procedure that allows transfer of different types of annotation, including structure and function. Each sequence in the database can be associated with a set of statistically validated Gene Ontologies (GOs) of the three main sub-ontologies (Molecular Function, Biological Process and Cellular Component), with Pfam functional domains, and when possible, with a cluster Hidden Markov Model that allows modelling the 3D structure of the protein. A database search allows some statistics demonstrating the enrichment in both GO and Pfam annotations of the pig proteins as compared with UniProt Knowledgebase annotation. Searching in SUS-BAR allows retrieval of the pig protein annotation for further analysis. The search is also possible on the basis of specific GO terms and this allows retrieval of all the pig sequences participating into a given biological process, after annotation with our system. Alternatively, the search is possible on the basis of structural information, allowing retrieval of all the pig sequences with the same structural characteristics.Database URL: http://bar.biocomp.unibo.it/pig/

Highlights

In recent years, significant progress has been made in pig genomics due to the integration of modern sequencing techniques and computational biology methods [1,2,3,4]
This observation is at the basis of building by comparison, one of the most successful methods for computing the 3D structure of a protein sequence when a template is found with a sequence similarity search against the Protein Data Bank (PDB) [11]
With the exception of the proteins listed in SwissProt, most of the annotation largely derives from feature transfer mainly based on profile and Hidden Markov Model (HMM) methodologies (UniProtKB/TrEMBL; http://www.ebi.ac.uk/ GOA)

Summary

Introduction

Significant progress has been made in pig genomics due to the integration of modern sequencing techniques and computational biology methods [1,2,3,4]. A similarity search where proteins with no annotation are analysed by their similarity to proteins with known annotation can be routinely performed to attribute structural and functional annotation to the unknown protein [5,6,7,8] The notion behind this procedure is that protein structure is more conserved than sequence through evolution, a very general concept that helps in modelling proteins with similar sequences as long as their sequence identity (SI) is 30% over the alignment length [9, 10]. Most of the sequence entries are proteins that have only been recognized on the basis of sequence similarity or predicted without any experimental evidence of their existence (http://www.ebi.ac.uk/uniprot/TrEMBLstats/) This is the case for majority of the currently available pig proteins, whose annotation relies mainly on automatic procedures

Methods

Results

Conclusion