Abstract

Profile Hidden Markov Models (HMMs) are graphical models that can be used to produce finite length sequences from a distribution. In fact, although they were only introduced for bioinformatics 25 years ago (by Haussler et al., Hawaii International Conference on Systems Science, 1993), they are arguably the most commonly used statistical model in bioinformatics, with multiple applications, including protein structure and function prediction, classifications of novel proteins into existing protein families and superfamilies, metagenomics, and multiple sequence alignment. The standard use of profile HMMs in bioinformatics has two steps: first a profile HMM is built for a collection of molecular sequences (which may not be in a multiple sequence alignment), and then the profile HMM is used in some subsequent analysis of new molecular sequences. The construction of the profile thus is itself a statistical estimation problem, since any given set of sequences might potentially fit more than one model well. Hence, a basic question about profile HMMs is whether they are statistically identifiable, which means that no two profile HMMs can produce the same distribution on finite length sequences. Indeed, statistical identifiability is a fundamental aspect of any statistical model, and yet it is not known whether profile HMMs are statistically identifiable. In this paper, we report on preliminary results towards characterizing the statistical identifiability of profile HMMs in one of the standard forms used in bioinformatics.

Highlights

  • PROFILE Hidden Markov Models (HMMs) are arguably the most common statistical models in bioinformatics

  • We address the question of statistical identifiability of profile Hidden Markov models, which in

  • The question we address in this paper is whether profile HMMs are statistically identifiable

Read more

Summary

INTRODUCTION

PROFILE Hidden Markov Models (HMMs) are arguably the most common statistical models in bioinformatics. In the standard form presented in [3] (widely in use in bioinformatics applications), each profile Hidden Markov model has a single start state and a single end state, and every path through the model produces a string from SÃ. To the best of our knowledge, nothing has yet been established about the statistical identifiability of profile Hidden Markov Models, the question of identifiability of parameters in HMMs more generally has been addressed [25].

Preliminary Material and Notation
No Deletion Nodes
The Standard Profile HMM with One Match State
À p a9
Estimating Parameters from Finite Data
UNKNOWN DISTRIBUTION AT THE INSERTION STATES
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call