Abstract
The abundance of diverse biological data from various sources constitutes a rich source of knowledge, which has the power to advance our understanding of organisms. This requires computational methods in order to integrate and exploit these data effectively and elucidate local and genome wide functional connections between protein pairs, thus enabling functional inferences for uncharacterized proteins. These biological data are primarily in the form of sequences, which determine functions, although functional properties of a protein can often be predicted from just the domains it contains. Thus, protein sequences and domains can be used to predict protein pair-wise functional relationships, and thus contribute to the function prediction process of uncharacterized proteins in order to ensure that knowledge is gained from sequencing efforts. In this work, we introduce information-theoretic based approaches to score protein-protein functional interaction pairs predicted from protein sequence similarity and conserved protein signature matches. The proposed schemes are effective for data-driven scoring of connections between protein pairs. We applied these schemes to the Mycobacterium tuberculosis proteome to produce a homology-based functional network of the organism with a high confidence and coverage. We use the network for predicting functions of uncharacterised proteins.AvailabilityProtein pair-wise functional relationship scores for Mycobacterium tuberculosis strain CDC1551 sequence data and python scripts to compute these scores are available at http://web.cbio.uct.ac.za/~gmazandu/scoringschemes.
Highlights
In recent years we have experienced an exponential growth of biological data, including primary data such as genomic sequences resulting from worldwide DNA sequencing efforts and as well as functional data from high-throughput experiments, respectively
Discovering sequence homology and modelling functional interactions between homologues from sequence and experimental data constitutes an important problem in molecular biology, as these can help to describe their behaviour in cellular processes and reveal the interplay between particular genes and proteins
Several Bioinformatics tools have been designed for deriving and storing these functional features. These include standard sequence comparison tools such as Basic Local Alignment Search Tool (BLAST) [3,4], protein sequence databases such as UniProt [5], and protein signature databases such as InterPro [6], which integrates together predictive models or protein signatures representing protein domains, families and functional sites, from multiple source databases, namely, PROSITE, Pfam, PRINTS, ProDom, SMART, TIGRFAMs, PIRSF and SUPERFAMILY, Gene3D, PANTHER [7]
Summary
In recent years we have experienced an exponential growth of biological data, including primary data such as genomic sequences resulting from worldwide DNA sequencing efforts and as well as functional data from high-throughput experiments, respectively. This abundance of primary sequence data and the large availability of public gene and protein sequence databases have the capability to provide many new insights into the biology of organisms. Sequences sharing similar or conserved features are referred to as homologous sequences, and these features can be used for inferring and scoring protein pair-wise functional connections. These include standard sequence comparison tools such as BLAST [3,4], protein sequence databases such as UniProt [5], and protein signature databases such as InterPro [6], which integrates together predictive models or protein signatures representing protein domains, families and functional sites, from multiple source databases, namely, PROSITE, Pfam, PRINTS, ProDom, SMART, TIGRFAMs, PIRSF and SUPERFAMILY, Gene3D, PANTHER [7]
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have