Abstract

Discrete Markovian models can be used to characterize patterns in sequences of values and have many applications in biological sequence analysis, including gene prediction, CpG island detection, alignment, and protein profiling. We present ToPS, a computational framework that can be used to implement different applications in bioinformatics analysis by combining eight kinds of models: (i) independent and identically distributed process; (ii) variable-length Markov chain; (iii) inhomogeneous Markov chain; (iv) hidden Markov model; (v) profile hidden Markov model; (vi) pair hidden Markov model; (vii) generalized hidden Markov model; and (viii) similarity based sequence weighting. The framework includes functionality for training, simulation and decoding of the models. Additionally, it provides two methods to help parameter setting: Akaike and Bayesian information criteria (AIC and BIC). The models can be used stand-alone, combined in Bayesian classifiers, or included in more complex, multi-model, probabilistic architectures using GHMMs. In particular the framework provides a novel, flexible, implementation of decoding in GHMMs that detects when the architecture can be traversed efficiently.

Highlights

  • Markov models of nucleic acids and proteins are widely used in bioinformatics

  • Another example is the Variable Length Markov Chain in which the user has to set a parameter that controls the pruning of the probabilistic suffix tree

  • Scripts, configuration files and sequence data to reproduce the experiments are available through the ToPS homepage

Read more

Summary

Introduction

Examples of applications include ab initio gene prediction [1], CpG island detection [2], protein family characterization [3], and sequence alignment [4] Many times these models are hard coded in the analysis software, which means wellknown algorithms are implemented over and over again. A system providing a wide range of these models is important to allow researchers to quickly select the most appropriate model to analyze sequences of different problem domains. In some cases, such as gene prediction, the characterization of the family of sequences may involve using various probabilistic models integrated in a single architecture. Another alternative is a general-purpose system that can implement different models such as gHMM [8], HTK [9], HMMoC [10] and HMMConverter [11], N-SCAN [12] and Tigrscan [13] ( known as Genezilla)

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.