Abstract
Most methods for biological sequence homology search and alignment work with primary sequence alone, neglecting higher-order correlations. Recently, statistical physics models called Potts models have been used to infer all-by-all pairwise correlations between sites in deep multiple sequence alignments, and these pairwise couplings have improved 3D structure predictions. Here we extend the use of Potts models from structure prediction to sequence alignment and homology search by developing what we call a hidden Potts model (HPM) that merges a Potts emission process to a generative probability model of insertion and deletion. Because an HPM is incompatible with efficient dynamic programming alignment algorithms, we develop an approximate algorithm based on importance sampling, using simpler probabilistic models as proposal distributions. We test an HPM implementation on RNA structure homology search benchmarks, where we can compare directly to exact alignment methods that capture nested RNA base-pairing correlations (stochastic context-free grammars). HPMs perform promisingly in these proof of principle experiments.
Highlights
An important task in bioinformatics is determining whether a new sequence of unknown biological function is evolutionarily related, or homologous, to other known sequences or families of sequences
Recent advances in 3D protein structure prediction have used a class of statistical physics models called Potts models to infer pairwise correlation structure in multiple sequence alignments
We have extended Potts models to include a probability model of insertion and deletion so they can be applied to sequence alignment and remote homology search using a new model we call a hidden Potts model (HPM)
Summary
An important task in bioinformatics is determining whether a new sequence of unknown biological function is evolutionarily related, or homologous, to other known sequences or families of sequences. Critical to the concept of homology is alignment: homology tools create multiple sequence alignments (MSAs) in which evolutionarily related positions are aligned in columns by inferring patterns of sequence conservation induced by complex evolutionary constraints maintaining the structure and function of the sequence [1]. One possible way to improve the sensitivity of homology search and alignment is to develop new methods that successfully capture patterns of residue correlation induced by 3D structural constraints. State-of-the-art homology search methods do not model certain important elements of structure-induced conservation. Methods such as BLAST and HMMER, the latter of which uses profile hidden Markov models (pHMMs), align and score sequences using primary sequence conservation alone [3,4,5]. Infernal is limited to nested, disjoint pairs of nucleotides, meaning it cannot capture complicated 3D RNA structural elements like pseudoknots and base triples, let alone complex correlation structure in protein MSAs
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.