Protein Sequence Families Research Articles

The discovery of motifs in biosequences is frequently torn between the rigidity of the model on the one hand and the abundance of candidates on the other. In particular, the variety of motifs described by strings that include 'don't care' (dot) patterns escalates exponentially with the length of the motif, and this gets only worse if a dot is allowed to stretch up to some prescribed maximum length. This circumstance tends to generate daunting computational burdens, and often gives rise to tables that are impossible to visualize and digest. This is unfortunate, as it seems to preclude precisely those massive analyses that have become conceivable with the increasing availability of massive genomic and protein data. Although a part of the problem is endemic, another part of it seems rooted in the various characterizations offered for the notion of a motif, that are typically based either on syntax or on statistics alone. It seems worthwhile to consider alternatives that result from a prudent combination of these two aspects in the model. We introduce and study a notion of extensible motif in a sequence which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. We show that a combination of appropriate saturation conditions (expressed in terms of minimum number of dots compatible with a given list of occurrences) and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate over-represented motifs. The merits of the method are documented by the results obtained in implementation, which specifically targeted protein sequence families. In all cases tested, the motif reported in PROSITE as the most important in terms of functional/structural relevance emerges among the top 30 extensible motifs returned by our algorithm, often right at the top. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constrains. This software will be available for use with the suite of tools at www.research.ibm.com/bioinformatics.

Read full abstract

The explosion in gene sequence data and technological breakthroughs in protein structure determination inspired the launch of structural genomics (SG) initiatives. An often stated goal of structural genomics is the high-throughput structural characterisation of all protein sequence families, with the long-term hope of significantly impacting on the life sciences, biotechnology and drug discovery. Here, we present a comprehensive analysis of solved SG targets to assess progress of these initiatives. Eleven consortia have contributed 316 non-redundant entries and 323 protein chains to the Protein Data Bank (PDB), and 459 and 393 domains to the CATH and SCOP structure classifications, respectively. The quality and size of these proteins are comparable to those solved in traditional structural biology and, despite huge scope for duplicated efforts, only 14% of targets have a close homologue (>/=30% sequence identity) solved by another consortium. Analysis of CATH and SCOP revealed the significant contribution that structural genomics is making to the coverage of superfamilies and folds. A total of 67% of SG domains in CATH are unique, lacking an already characterised close homologue in the PDB, whereas only 21% of non-SG domains are unique. For 29% of domains, structure determination revealed a remote evolutionary relationship not apparent from sequence, and 19% and 11% contributed new superfamilies and folds. The secondary structure class, fold and superfamily distributions of this dataset reflect those of the genomes. The domains fall into 172 different folds and 259 superfamilies in CATH but the distribution is highly skewed. The most populous of these are those that recur most frequently in the genomes. Whilst 11% of superfamilies are bacteria-specific, most are common to all three superkingdoms of life and together the 316 PDB entries have provided new and reliable homology models for 9287 non-redundant gene sequences in 206 completely sequenced genomes. From the perspective of this analysis, it appears that structural genomics is on track to be a success, and it is hoped that this work will inform future directions of the field.

Read full abstract

Protein Sequence Families Research Articles

Related Topics

Articles published on Protein Sequence Families

A Lazy Data Mining Approach for Protein Classification

A Structural Viral Mimic of Prosurvival Bcl-2: A Pivotal Role for Sequestering Proapoptotic Bax and Bak

The structure ofPlasmodium vivaxphosphatidylethanolamine-binding protein suggests a functional motif containing a left-handed helix

ProFaM: An Efficient Algorithm for Protein Sequence Family Mining

Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost

Evolution of protein structural classes and protein sequence families

REMUS: a tool for identification of unique peptide segments as epitopes

Identification of latent periodicity in amino acid sequences of protein families

Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs)

SIMPROT: Using an empirically determined indel distribution in simulations of protein evolution

Protein Family Clustering for Structural Genomics

MAVL/StickWRLD for protein: visualizing protein sequence families to detect non-consensus features

Location proteomics: a systems approach to subcellular location

Conservative extraction of over-represented extensible motifs

Progress of Structural Genomics Initiatives: An Analysis of Solved Target Structures

Determining functional specificity from protein sequences

Recapitulation of Protein Family Divergence using Flexible Backbone Protein Design

Into the heart of darkness: large-scale clustering of human non-coding DNA.

Particle Swarm Optimisation for Protein Motif Discovery

FamClash: a method for ranking the activity of engineered enzymes.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Protein Sequence Families Research Articles

Related Topics

Articles published on Protein Sequence Families

A Lazy Data Mining Approach for Protein Classification

A Structural Viral Mimic of Prosurvival Bcl-2: A Pivotal Role for Sequestering Proapoptotic Bax and Bak

The structure ofPlasmodium vivaxphosphatidylethanolamine-binding protein suggests a functional motif containing a left-handed helix

ProFaM: An Efficient Algorithm for Protein Sequence Family Mining

Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost

Evolution of protein structural classes and protein sequence families

REMUS: a tool for identification of unique peptide segments as epitopes

Identification of latent periodicity in amino acid sequences of protein families

Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs)

SIMPROT: Using an empirically determined indel distribution in simulations of protein evolution

Protein Family Clustering for Structural Genomics

MAVL/StickWRLD for protein: visualizing protein sequence families to detect non-consensus features

Location proteomics: a systems approach to subcellular location

Conservative extraction of over-represented extensible motifs

Progress of Structural Genomics Initiatives: An Analysis of Solved Target Structures

Determining functional specificity from protein sequences

Recapitulation of Protein Family Divergence using Flexible Backbone Protein Design

Into the heart of darkness: large-scale clustering of human non-coding DNA.

Particle Swarm Optimisation for Protein Motif Discovery

FamClash: a method for ranking the activity of engineered enzymes.