Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'

Qi Dai,Tianming Wang

doi:10.1186/1471-2105-9-394

Abstract

BackgroundMany proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using k-word frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: gre.k (generalized relative entropy) and gsm.k (gapped similarity measure).ResultsWe tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignment-based measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained.ConclusionAlignment-based measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel gsm.k introduced by this article, the cos.k followed. When the data becomes less redundant, gre.k proposed by us achieves a better performance, but all the other measures perform poorly on classification tasks. Almost all the statistical measures achieve improvement by exploring the information on 'sequence space' as word's length increases, especially for less redundant data. The reasonable results of phylogenetic analysis confirm that Gdis.k based on 'sequence space' is a reliable measure for phylogenetic analysis. In summary, our quantitative analysis verifies that exploring the information on 'sequence space' is a promising way to improve the abilities of statistical measures for protein comparison.

Highlights

Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information
The Rost and Sander data set (RS126) (Additional file 2) was designed for the secondary structure prediction of proteins with a pair-wise sequence similarity of less than 25% [32], and it was used as a test data to evaluate the performances of similarity measures [33]
We compare the proteins' secondary structures, but analyse the performance ofsimilarity measures according to the proteins' classification as given by Structural Classification of Proteins (SCOP), release 1.69 [34]

Summary

Introduction

Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using k-word frequencies of protein sequences. Classification protein [14,15] is to get a biologically meaningful partition It has several advantages: when proteins are grouped into a family, it can provide us some clues about the general features of this family and evolutionary evidence of proteins, and further infer the biological function of a new sequence by its similarity to some function-known sequences. Protein classification can be used to facilitate protein threedimensional structure discovery, which is very important for understanding proteins' functions These computational methods heavily rely on the (dis)similarity measures defined among biological sequences

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Sep 23, 2008
Citations: 35	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Navigating the amino acid sequence space between functional proteins using a deep learning framework.
Tristan Bitard-Feildel
PeerJ. Computer science | VOL. 7
Tristan Bitard-FeildelTristan Bitard-Feildel
17 Sep 2021
PeerJ. Computer science | VOL. 7

Constraints on the expansion of paralogous protein families.
Conor J Mcclune ... Michael T Laub
Current Biology | VOL. 30
Conor J Mcclune, et. al.Conor J Mcclune ... Michael T Laub
01 May 2020
Current Biology | VOL. 30

Exploring Protein Sequence Space Using Computationally Directed Recombination

-

01 Jan 2006
01 Jan 2006

CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction.
Xuefeng Cui ... Jim Jing-Yan Wang
Bioinformatics | VOL. 32
Xuefeng Cui, et. al.Xuefeng Cui ... Jim Jing-Yan Wang
11 Jun 2016
Bioinformatics | VOL. 32

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics