Abstract
Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically well-founded, general framework for performing statistical inference. The mathematical formalization of MDL is based on the normalized maximum likelihood (NML) distribution, which has several desirable theoretical properties. In the case of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size, since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algorithms for efficient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending these algorithms to more complex, tree-structured Bayesian networks.
Highlights
Many problems in bioinformatics can be cast as model class selection tasks, that is, as tasks of selecting among a set of competing mathematical explanations the one that best describes a given sample of data
The minimum description length (MDL) principle developed in the series of papers [6,7,8] is a well-founded, general framework for performing model class selection and other types of statistical inference
The model families used in our work are Bayesian networks of varying complexity
Summary
Many problems in bioinformatics can be cast as model class selection tasks, that is, as tasks of selecting among a set of competing mathematical explanations the one that best describes a given sample of data. For multinomial (discrete) data, this definition involves a normalizing sum over all the possible data samples of a fixed size The logarithm of this sum is called the regret or parametric complexity, and it can be interpreted as the amount of complexity of the model class. The NML distribution has several theoretical optimality properties, which make it a very attractive candidate for performing model class selection and related tasks. A more complex case involving a multidimensional model family, called naive Bayes, was discussed in [16]. Both these cases are reviewed in this paper.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have