Clustering Of Protein Sequences Research Articles

BackgroundThe rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "phylogenetic" in the sense of "relatedness of biological functions".ResultsTo show the effectiveness of CLUSS, we performed an extensive clustering on COG database. To demonstrate its ability to deal with hard-to-align sequences, we tested it on the GH2 family. In addition, we carried out experimental comparisons of CLUSS with a variety of mainstream algorithms. These comparisons were made on hard-to-align and easy-to-align protein sequences. The results of these experiments show the superiority of CLUSS in yielding clusters of proteins with similar functional activity.ConclusionWe have developed an effective method and tool for clustering protein sequences to meet the needs of biologists in terms of phylogenetic analysis and prediction of biological functions. Compared to existing clustering methods, CLUSS more accurately highlights the functional characteristics of the clustered families. It provides biologists with a new and plausible instrument for the analysis of protein sequences, especially those that cause problems for the alignment-dependent algorithms.

Read full abstract

Selective Na(+)-dependent re-uptake of biogenic monoamines at mammalian nerve synapses is accomplished by three types of solute-linked carrier family 6 (SLC6) membrane transporter with high affinity for serotonin (SERTs), dopamine (DATs) and norepinephrine (NETs). An additional SLC6 monoamine transporter (OAT), is responsible for the selective uptake of the phenolamines octopamine and tyramine by insect neurons. We have characterized a similar high-affinity phenoloamine transporter expressed in the CNS of the earthworm Lumbricus terrestris. Phylogenetic analysis of its protein sequence clusters it with both arthropod phenolamine and chordate catecholamine transporters. To clarify the relationships among metazoan monoamine transporters we identified representatives in the major branches of metazoan evolution by polymerase chain reaction (PCR)-amplifying conserved cDNA fragments from isolated nervous tissue and by analyzing available genomic data. Analysis of conserved motifs in the sequence data suggest that the presumed common ancestor of modern-day Bilateria expressed at least three functionally distinct monoamine transporters in its nervous system: a SERT currently found throughout bilaterian phyla, a DAT now restricted in distribution to protostome invertebrates and echinoderms and a third monoamine transporter (MAT), widely represented in contemporary Bilateria, that is selective for catecholamines and/or phenolamines. Chordate DATs, NETs, epinephrine transporters (ETs) and arthropod and annelid OATs all belong to the MAT clade. Contemporary invertebrate and chordate DATs belong to different SLC6 clades. Furthermore, the genes for dopamine and norepinephrine transporters of vertebrates are paralogous, apparently having arisen through duplication of an invertebrate MAT gene after the loss of an invertebrate-type DAT gene in a basal protochordate.

Read full abstract

Clustering Of Protein Sequences Research Articles

Related Topics

Articles published on Clustering Of Protein Sequences

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

CLUSS: clustering of protein sequences based on a new similarity measure.

A maximum likelihood approximation method for Dirichlet's parameter estimation

Ancestry of neuronal monoamine transporters in the Metazoa

SEQOPTICS: a protein sequence clustering system

Efficient median based clustering and classification techniques for protein sequences

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Large Scale Protein Sequence Clustering - Not Solved But Solvable

Spectral clustering of protein sequences

Exploiting homogeneity in protein sequence clusters for construction of protein family hierarchies

Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks

An Algorithm to Classify Amino Acid Sequences into Protein Groups of Bothrops jararacussu Venomous Gland

Large scale hierarchical clustering of protein sequences

Super paramagnetic clustering of protein sequences

Cluster-C, an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques

Graph-based clustering for finding distant relationships in a large set of protein sequences

Clustering protein sequence and structure space with infinite Gaussian mixture models.

Characterization of a divergent non-classical MHC class I gene in sharks.

Coverage of protein sequence space by current structural genomics targets.

Gclust: Genome-Wide Clustering of Protein Sequences for Identification of Photosynthesis-Related Genes Resulting from Massive Horizontal Gene Transfer

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Clustering Of Protein Sequences Research Articles

Related Topics

Articles published on Clustering Of Protein Sequences

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

CLUSS: clustering of protein sequences based on a new similarity measure.

A maximum likelihood approximation method for Dirichlet's parameter estimation

Ancestry of neuronal monoamine transporters in the Metazoa

SEQOPTICS: a protein sequence clustering system

Efficient median based clustering and classification techniques for protein sequences

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Large Scale Protein Sequence Clustering - Not Solved But Solvable

Spectral clustering of protein sequences

Exploiting homogeneity in protein sequence clusters for construction of protein family hierarchies

Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks

An Algorithm to Classify Amino Acid Sequences into Protein Groups of Bothrops jararacussu Venomous Gland

Large scale hierarchical clustering of protein sequences

Super paramagnetic clustering of protein sequences

Cluster-C, an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques

Graph-based clustering for finding distant relationships in a large set of protein sequences

Clustering protein sequence and structure space with infinite Gaussian mixture models.

Characterization of a divergent non-classical MHC class I gene in sharks.

Coverage of protein sequence space by current structural genomics targets.

Gclust: Genome-Wide Clustering of Protein Sequences for Identification of Photosynthesis-Related Genes Resulting from Massive Horizontal Gene Transfer