Abstract

Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically well-founded, general framework for performing statistical inference. The mathematical formalization of MDL is based on the normalized maximum likelihood (NML) distribution, which has several desirable theoretical properties. In the case of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size, since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algorithms for efficient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending these algorithms to more complex, tree-structured Bayesian networks.

Highlights

  • Many problems in bioinformatics can be cast as model class selection tasks, that is, as tasks of selecting among a set of competing mathematical explanations the one that best describes a given sample of data

  • The minimum description length (MDL) principle developed in the series of papers [6,7,8] is a well-founded, general framework for performing model class selection and other types of statistical inference

  • The model families used in our work are Bayesian networks of varying complexity

Read more

Summary

INTRODUCTION

Many problems in bioinformatics can be cast as model class selection tasks, that is, as tasks of selecting among a set of competing mathematical explanations the one that best describes a given sample of data. For multinomial (discrete) data, this definition involves a normalizing sum over all the possible data samples of a fixed size The logarithm of this sum is called the regret or parametric complexity, and it can be interpreted as the amount of complexity of the model class. The NML distribution has several theoretical optimality properties, which make it a very attractive candidate for performing model class selection and related tasks. A more complex case involving a multidimensional model family, called naive Bayes, was discussed in [16]. Both these cases are reviewed in this paper.

PROPERTIES OF THE MDL PRINCIPLE AND THE NML MODEL
Model classes and families
The NML distribution
NML FOR MULTINOMIAL MODELS
The model family
The quadratic-time algorithm
The linear-time algorithm
Approximating the multinomial NML
NML FOR THE NAIVE BAYES MODEL
NML FOR BAYESIAN FORESTS
The algorithm
Leaves
Inner nodes
Component tree roots
1: Count all frequencies fikl and fil from the data xn
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call