Abstract
A finite-context (Markov) model of order yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth . Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.
Highlights
IntroductionSince the work of Grumbach and Tahi [1], many contributions have been made in the area of DNA data compression (see, for example, [2,3,4,5,6,7,8,9,10] and for a recent review [11])
Since the work of Grumbach and Tahi [1], many contributions have been made in the area of DNA data compression. These works explore the non-stationary nature of DNA sequence data, which are characterized by an alternation between regions of relatively high and low entropy
We address a modeling question that we do believe has not been satisfactorily answered before: How well can complete genomes be described using exclusively a combination of Markov models? We seek descriptions that are good in the sense of the minimal description length principle [26], i.e., that require as few bits as possible for representing the information
Summary
Since the work of Grumbach and Tahi [1], many contributions have been made in the area of DNA data compression (see, for example, [2,3,4,5,6,7,8,9,10] and for a recent review [11]). According to the substitutional paradigm, repeated regions of the DNA sequence are represented by a pointer to a past occurrence of the repetition and by the length of the repeating sequence. Both exact and approximate repetitions have been explored, as well as their inverted complements. A large number of publications have addressed this topic, mainly with the aim of proposing techniques for gene finding (some examples can be found in [16,17,18,19,20,21,22]) Other applications, such as the detection of short inverted DNA segments [23], the assessment of the statistical significance of DNA patterns [24] or the identification of CpG islands [25], have relied on Markov models.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.