Abstract

In this paper, we study the problem of constructing models for a stationary stochastic process {y t } assuming values in a finite set M:= {1,..., m}. It is assumed that only a finite length sample path of the process is known, and not the full statistics of the process. Two kinds of problems are studied, namely: modelling for prediction, and modelling for classification. For the prediction problem, in a companion paper it is shown that a well-known approach of modelling the given process as a multi-step Markov process is in fact the only solution satisfying certain nonnegativity constraints. In the present paper, accuracy and confidence bounds are derived for the parameters of this multi-step Markov model. So far as the author is aware, such bounds have not been published previously. For the classification problem, it is assumed that two distinct sets of sample paths of two separate stochastic processes are available - call them {u 1 ,..., u r } and {v 1 ,..., v s }. The objective here is to develop not one but two models, called C and MC respectively, such that the strings u i have much larger likelihoods with the model C than with the model AfC, and the opposite is true for the strings Vj. Then a new string w is classified into the set C or MC according as its likelihood is larger from the model C or the model NC. For the classification problem, we develop a new algorithm called the 4M (Mixed Memory Markov Model) algorithm, which is an improvement over variable length Markov models. We then apply the 4M algorithm to the problem of finding genes from the genome. The performance of the 4M algorithm is compared against that of the popular Glimmer algorithm. In most of the test cases studied, the 4M algorithm correctly classifies both coding as well as non-coding regions more than 90% of the time. Moreover, the accuracy of the 4M algorithm compares well with that of Glimmer. At the same time, the 4M algorithm is amenable to statistical analysis.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.