Abstract

Statistical methods have been developed for finding local patterns, also called motifs, in multiple protein sequences. The aligned segments may imply functional or structural core regions. However, the existing methods often have difficulties in aligning multiple proteins when sequence residue identities are low (e.g., less than 25%). In this article, we develop a Bayesian model and Markov chain Monte Carlo (MCMC) methods for identifying subtle motifs in protein sequences. Specifically, a motif is defined not only in terms of specific sites characterized by amino acid frequency vectors, but also as a combination of secondary characteristics such as hydrophobicity, polarity, etc. Markov chain Monte Carlo methods are proposed to search for a motif pattern with high posterior probability under the new model. A special MCMC algorithm is developed, involving transitions between state spaces of different dimensions. The proposed methods were supported by a simulated study. It was then tested by two real datasets, including a group of helix-turn-helix proteins, and one set from the CATH Protein Structure Classification Database. Statistical comparisons showed that the new approach worked better than a typical Gibbs sampling approach which is based only on an amino acid model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call