Dirichlet Bayesian network scores and the maximum relative entropy principle

Marco Scutari

doi:10.1007/s41237-018-0048-x

Abstract

A classic approach for learning Bayesian networks from data is to identify a maximum a posteriori (MAP) network structure. In the case of discrete Bayesian networks, MAP networks are selected by maximising one of several possible Bayesian–Dirichlet (BD) scores; the most famous is the Bayesian–Dirichlet equivalent uniform (BDeu) score from Heckerman et al. (Mach Learn 20(3):197–243, 1995). The key properties of BDeu arise from its uniform prior over the parameters of each local distribution in the network, which makes structure learning computationally efficient; it does not require the elicitation of prior knowledge from experts; and it satisfies score equivalence. In this paper we will review the derivation and the properties of BD scores, and of BDeu in particular, and we will link them to the corresponding entropy estimates to study them from an information theoretic perspective. To this end, we will work in the context of the foundational work of Giffin and Caticha (Proceedings of the 27th international workshop on Bayesian inference and maximum entropy methods in science and engineering, pp 74–84, 2007), who showed that Bayesian inference can be framed as a particular case of the maximum relative entropy principle. We will use this connection to show that BDeu should not be used for structure learning from sparse data, since it violates the maximum relative entropy principle; and that it is also problematic from a more classic Bayesian model selection perspective, because it produces Bayes factors that are sensitive to the value of its only hyperparameter. Using a large simulation study, we found in our previous work [Scutari in J Mach Learn Res (Proc Track PGM 2016) 52:438–448, 2016] that the Bayesian–Dirichlet sparse (BDs) score seems to provide better accuracy in structure learning; in this paper we further show that BDs does not suffer from the issues above, and we recommend to use it for sparse data instead of BDeu. Finally, will show that these issues are in fact different aspects of the same problem and a consequence of the distributional assumptions of the prior.

Highlights

Bayesian networks (BNs; Pearl 1988; Koller and Friedman 2009) are probabilistic graphical models based on a directed acyclic graph (DAG) whose nodes are associated with a set of random variables = {X1, ... , XN} following some distribution P( ) . (The two are referred to interchangeably.) Formally, is defined as an independency map of P( ) such that: A ⟂G B| C ⟹ A ⟂P B| C, where A, B and C are disjoint subsets of
In this paper we focus on score-based structure learning in a Bayesian framework, in which we aim to identify a maximum a posteriori (MAP) DAG that directly maximises P( | )
We find that Bayesian–Dirichlet equivalent uniform (BDeu) is not a reliable network score when applied to sparse data because it can select overly complex networks over simpler ones given the same information in the prior and in the data; and that in the process it violates the maximum relative entropy principle

Summary

Introduction and background

Common choices for the network score include the Bayesian Information Criterion (BIC) and the marginal likelihood P( | ) itself; for an overview see again Scutari and Denis (2014) We will cover both in more detail for discrete BNs in Sect. As for parameter learning, the parameters Xi can be estimated independently for each node following (1) since its parents are assumed to be known from structure learning Both maximum likelihood and Bayesian posterior estimators are in common use, with the latter being preferred due to their smoothness and superior predictive power (Koller and Friedman 2009). We will show that the uniform prior distribution over each Xi that underlies BDeu can be problematic from a Bayesian perspective, resulting in wildly different Bayes factors (and structure learning outcomes) depending on the value of its only hyperparameter, the imaginary sample size.

Bayesian–Dirichlet marginal likelihoods

BDeu and Bayesian model selection

Bayesian structure learning and entropy

The posterior marginal entropy

BDeu and the principle of maximum entropy

Compliance with ethical standards