Information About the Moments or the Likelihood Model Parameters? A Chicken and Egg Problem
ABSTRACTThis article compares the information content of a sample for two competing Bayesian approaches. One approach follows Dennis Lindley's Bayesian standpoint, where one begins by formulating a prior for a parameter related to the problem in question and incorporates a likelihood to transition to a posterior. This contrasts with the usual Bayesian approach, where one starts with a likelihood model, formulates a prior distribution for its parameters, and derives the corresponding posterior. In both cases, the sample information content is measured using the difference between the prior and posterior entropies. We investigate this contrast in the context of learning about the moments of a variable. The maximum entropy principle is used to construct the likelihood model consistent with the given moment parameters. This likelihood model is then combined with the prior information on the parameters to derive the posterior. The model parameters are the Lagrange multipliers for the moment constraints. A prior for the moments induces a prior for the model parameters; however, the data provides differing amounts of information about them. The results obtained for several problems show that the information content using the two formulations can differ significantly. Additional information measures are derived to assess the effects of operating environments on the lifetimes of system components.
- Research Article
131
- 10.1139/cgj-2013-0004
- Jul 1, 2013
- Canadian Geotechnical Journal
This paper develops Bayesian approaches for underground soil stratum identification and soil classification using cone penetration tests (CPTs). The uncertainty in the CPT-based soil classification using the Robertson chart is modeled explicitly in the Bayesian approaches, and the probability that the soil belongs to one of the nine soil types in the Robertson chart based on a set of CPT data is formulated using the maximum entropy principle. The proposed Bayesian approaches contain two major components: a Bayesian model class selection approach to identify the most probable number of underground soil layers and a Bayesian system identification approach to simultaneously estimate the most probable layer thicknesses and classify the soil types. Equations are derived for the Bayesian approaches, and the proposed approaches are illustrated using a real-life CPT performed at the National Geotechnical Experimentation Site (NGES) at Texas A&M University, USA. It has been shown that the proposed approaches properly identify the underground soil stratification and classify the soil type of each layer. In addition, as the number of model classes increases, the Bayesian model class selection approach identifies the soil layers progressively, starting from the statistically most significant boundary and gradually zooming into less significant ones with improved resolution. Furthermore, it is found that the evolution of the identified soil strata as the model class increases provides additional valuable information for assisting in the interpretation of CPT data in a rational and transparent manner.
- Research Article
4
- 10.1108/03684920710741143
- Feb 20, 2007
- Kybernetes
PurposeIn many problems involving decision‐making under uncertainty, the underlying probability model is unknown but partial information is available. In some approaches to this problem, the available prior information is used to define an appropriate probability model for the system uncertainty through a probability density function. When the prior information is available as a finite sequence of moments of the unknown probability density function (PDF) defining the appropriate probability model for the uncertain system, the maximum entropy (ME) method derives a PDF from an exponential family to define an approximate model. This paper, aims to investigate some optimality properties of the ME estimates.Design/methodology/approachFor n>m, when the exact model can be best approximated by one of an infinite number of unknown PDFs from an n parameter exponential family. The upper bound of the divergence distance between any PDF from this family and the m parameter exponential family PDF defined by the ME method are derived. A measure of adequacy of the model defined by ME method is thus provided.FindingsThese results may be used to establish confidence intervals on the estimate of a function of the random variable when the ME approach is employed. Additionally, it is shown that when working with large samples of independent observations, a probability density function (PDF) can be defined from an exponential family to model the uncertainty of the underlying system with measurable accuracy. Finally, a relationship with maximum likelihood estimation for this case is established.Practical implicationsThe so‐called known moments problem addressed in this paper has a variety of applications in learning, blind equalization and neural networks.Originality/valueAn upper bound for error in approximating an unknown density function, f(x) by its ME estimate based on m moment constraints, obtained as a PDF p(x, α) from an m parameter exponential family is derived. The error bound will help us decide if the number of moment constraints is adequate for modeling the uncertainty in the system under study. In turn, this allows one to establish confidence intervals on an estimate of some function of the random variable, X, given the known moments. It is also shown how, when working with a large sample of independent observations, instead of precisely known moment constraints, a density from an exponential family to model the uncertainty of the underlying system with measurable accuracy can be defined. In this case, a relationship to ML estimation is established.
- Conference Article
- 10.21437/interspeech.2011-29
- Aug 27, 2011
This paper investigates a multi-speaker modeling technique with shared prior distributions and model structures for Bayesian speech synthesis. The quality of synthesized speech is improved by selecting appropriate model structures in HMMbased speech synthesis. Bayesian approach is known to work for such model selection. However, the result is strongly affected by prior distributions of model parameters. Therefore, determination of prior distributions and selection of model structures should be performed simultaneously. This paper investigates prior distributions and model structures in the situation where training data of multiple speakers are available. The prior distributions and model structures which represent acoustic features common to every speakers can be obtained by sharing them between multiple speaker-dependent models. Index Terms: speech synthesis, Bayesian approach, prior distribution, context clustering, multi-speaker modeling A statistical parametric speech synthesis system based on hidden Markov models (HMMs) was recently developed. In HMM-based speech synthesis, the spectrum, excitation, and duration of speech are simultaneously modeled with HMMs, and speech parameter sequences are generated from the HMMs themselves [1]. The maximum likelihood (ML) criterion has typically been used for training HMMs and generating speech parameters. The ML criterion guarantees that the ML estimates approach the true values of the parameters. However, since the ML criterion produces a point estimate of the model parameters, its estimation accuracy may degrade when the amount of training data is insufficient. In the Bayesian approach, all variables introduced when the models are parameterized, such as model parameters and latent variables, are treated as random variables, and their posterior distributions are obtained by the Bayes theorem. The Bayesian approach can generally construct a more robust model than the ML approach by estimating posterior distributions. Recently, Bayesian speech synthesis has been proposed as a Bayesian framework for statistical parametric speech synthesis (e.g., HMM-based speech synthesis), and it shows good performance [2]. In Bayesian speech synthesis, all processes for constructing the system can be derived from a single predictive distribution that directly represents the problem of speech synthesis. The quality of synthesized speech is improved by selecting appropriate model structures in HMM-based speech synthesis. Although the Bayesian approach is known to work for such model selection, the results are strongly affected by prior distributions of the model parameters. Therefore, in Bayesian speech synthesis, determination of prior distributions and selection of model structures should be performed simultaneously. To overcome this problem, we have proposed Bayesian context clustering using cross validation [3]. In this method, prior distributions are determined by using a part of training data, and model structures are evaluated by using the determined prior distribution based on cross validation. In this paper, we investigates prior distributions and model
- Research Article
26
- 10.1145/2297456.2297460
- Jul 1, 2012
- ACM Transactions on Knowledge Discovery from Data
We present an extension to Jaynes’ maximum entropy principle that incorporates latent variables. The principle oflatent maximum entropywe propose is different from both Jaynes’ maximum entropy principle and maximum likelihood estimation, but can yield better estimates in the presence of hidden variables and limited training data. We first show that solving for a latent maximum entropy model poses a hard nonlinear constrained optimization problem in general. However, we then show that feasible solutions to this problem can be obtained efficiently for the special case of log-linear models---which forms the basis for an efficient approximation to the latent maximum entropy principle. We derive an algorithm that combines expectation-maximization with iterative scaling to produce feasible log-linear solutions. This algorithm can be interpreted as an alternating minimization algorithm in the information divergence, and reveals an intimate connection between the latent maximum entropy and maximum likelihood principles. To select a final model, we generate a series of feasible candidates, calculate the entropy of each, and choose the model that attains the highest entropy. Our experimental results show that estimation based on the latent maximum entropy principle generally gives better results than maximum likelihood when estimating latent variable models on small observed data samples.
- Research Article
5
- 10.1021/acs.jctc.2c01090
- Apr 6, 2023
- Journal of Chemical Theory and Computation
Maximum entropy methods (MEMs) determine posterior distributions by combining experimental data with prior information. MEMs are frequently used to reconstruct conformational ensembles of molecular systems for experimental information and initial molecular ensembles. We performed time-resolved Förster resonance energy transfer (FRET) experiments to probe the interdye distance distributions of the lipase-specific foldase Lif in the apo state, which likely has highly flexible, disordered, and/or ordered structural elements. Distance distributions estimated from ensembles of molecular dynamics (MD) simulations serve as prior information, and FRET experiments, analyzed within a Bayesian framework to recover distance distributions, are used for optimization. We tested priors obtained by MD with different force fields (FFs) tailored to ordered (FF99SB, FF14SB, and FF19SB) and disordered proteins (IDPSFF and FF99SBdisp). We obtained five substantially different posterior ensembles. As in our FRET experiments the noise is characterized by photon counting statistics, for a validated dye model, MEM can quantify consistencies between experiment and prior or posterior ensembles. However, posterior populations of conformations are uncorrelated to structural similarities for individual structures selected from different prior ensembles. Therefore, we assessed MEM simulating varying priors in synthetic experiments with known target ensembles. We found that (i) the prior and experimental information must be carefully balanced for optimal posterior ensembles to minimize perturbations of populations by overfitting and (ii) only ensemble-integrated quantities like inter-residue distance distributions or density maps can be reliably obtained but not ensembles of atomistic structures. This is because MEM optimizes ensembles but not individual structures. This result for a highly flexible system suggests that structurally varying priors calculated from varying prior ensembles, e.g., generated with different FFs, may serve as an ad hoc estimate for MEM reconstruction robustness.
- Research Article
60
- 10.1007/s004770050024
- Dec 4, 1998
- Stochastic Hydrology and Hydraulics
The similarity between maximum entropy (MaxEnt) and minimum relative entropy (MRE) allows recent advances in probabilistic inversion to obviate some of the shortcomings in the former method. The purpose of this paper is to review and extend the theory and practice of minimum relative entropy. In this regard, we illustrate important philosophies on inversion and the similarly and differences between maximum entropy, minimum relative entropy, classical smallest model (SVD) and Bayesian solutions for inverse problems. MaxEnt is applicable when we are determining a function that can be regarded as a probability distribution. The approach can be extended to the case of the general linear problem and is interpreted as the model which fits all the constraints and is the one model which has the greatest multiplicity or “spreadout” that can be realized in the greatest number of ways. The MRE solution to the inverse problem differs from the maximum entropy viewpoint as noted above. The relative entropy formulation provides the advantage of allowing for non-positive models, a prior bias in the estimated pdf and `hard' bounds if desired. We outline how MRE can be used as a measure of resolution in linear inversion and show that MRE provides us with a method to explore the limits of model space. The Bayesian methodology readily lends itself to the problem of updating prior probabilities based on uncertain field measurements, and whose truth follows from the theorems of total and compound probabilities. In the Bayesian approach information is complete and Bayes' theorem gives a unique posterior pdf. In comparing the results of the classical, MaxEnt, MRE and Bayesian approaches we notice that the approaches produce different results. In␣comparing MaxEnt with MRE for Jayne's die problem we see excellent comparisons between the results. We compare MaxEnt, smallest model and MRE approaches for the density distribution of an equivalent spherically-symmetric earth and for the contaminant plume-source problem. Theoretical comparisons between MRE and Bayesian solutions for the case of the linear model and Gaussian priors may show different results. The Bayesian expected-value solution approaches that of MRE and that of the smallest model as the prior distribution becomes uniform, but the Bayesian maximum aposteriori (MAP) solution may not exist for an underdetermined case with a uniform prior.
- Research Article
- 10.1111/j.2517-6161.1992.tb01865.x
- Sep 1, 1992
- Journal of the Royal Statistical Society Series B: Statistical Methodology
Discussion of the Paper by Donoho, Johnstone, Hoch and Stern
- Research Article
4
- 10.1088/0264-9381/5/11/008
- Nov 1, 1988
- Classical and Quantum Gravity
The thermodynamic equilibrium configurations of relativistic rotating stars are studied using the maximum entropy principle. It is shown that the heuristic arguments of Thorne (1971) and Zeldovich (1969) for the equilibrium conditions can be developed into a maximum entropy principle in which the variations are carried out in a fixed background spacetime. This maximum principle with the fixed background assumption is technically simpler than, but has to be justified by, a maximum entropy principle without the assumption. Such a maximum entropy principle is also formulated in this paper, showing that the general relativistic system can be treated on the same footing as other long-range force systems.
- Research Article
16
- 10.1063/1.528841
- Oct 1, 1990
- Journal of Mathematical Physics
This paper proposes an approach via the maximum entropy principle in order to determine the nonstationary solutions of the Fokker–Planck equation with time varying coefficients. The constraints are not the state moments (as usual) but their dynamic equations. The maximum entropy principle herein utilized is a slight extension of Jaynes’ principle, which involves the ‘‘path entropy’’ of the stochastic process.
- Book Chapter
10
- 10.1007/978-94-017-2217-9_31
- Jan 1, 1993
In this paper we give a short review of the Maximum Entropy (ME) principle used to solve inverse problems. We distinguish three fundamentally different approaches for solving inverse problems when using the ME principle: a) Classical ME in which the unknown function is considered to be or to have the properties of a probability density function, b) ME in mean in which the unknown function is assumed to be a random function and the data are assumed to be the expected values of some finite number of known constraints on the unknown function, and finally, c) Bayesian approach with ME priors. In this case the ME principle is used only for assigning a probability distribution to the unknown function to translate our prior knowledge about it. In each approach, we describe the main ideas and give explicitly the hypothesis, the practical and the theoretical limitations.
- Conference Article
9
- 10.1109/ccdc49329.2020.9164431
- Aug 1, 2020
Entropy is a measure of the degree of chaos in the system which comes from physics. Then scientists proposed information entropy form a mathematical perspective. Later, it discovered the relationship between entropy and information entropy. This broke down the barriers between disciplines and derived many related conceptual principles. Among them, the principle of maximum entropy is widely used in disciplines such as finance, computer, etc. and many applications and technologies based on it was emerged. This paper introduces the principle of entropy and maximum entropy principle and reviews the application and development of the maximum entropy principle in analysis of clustering, decision and spectrum. Drawing upon our literature survey this paper presents a new application of maximum entropy principle, called elastic net of clustering based on maximum entropy (ENCM), which applies the maximum entropy principle to elastic net to change the objective function to solve clustering. Experiments verify that this method could effectively improve the result of clustering.
- Research Article
2
- 10.12928/telkomnika.v15i1.4255
- Mar 1, 2017
- TELKOMNIKA (Telecommunication Computing Electronics and Control)
In this paper, based on the definition of two-parameter joint entropy and the maximum entropy principle, a method was proposed to determine the prior distribution by using the maximum entropy method in the reliability evaluation of low-voltage switchgear. The maximum entropy method takes kinds of priori information as different constraints. The optimal prior distribution was selected by maximizing entropy under these constraints, which not only contains the known prior information but also tries to avoid the introduction of other assumption information. Based on non-parametric bootstrap method, the hyper-parameters of prior distribution is obtained by two-order moment of prior information. Finally, with the bootstrap method, the prior distribution robustness and the posterior robustness were analyzed, and the posterior mean time between failures for the low-voltage switchgear was estimated.
- Research Article
200
- 10.2307/2290129
- Dec 1, 1988
- Journal of the American Statistical Association
This article is concerned with the selection of subsets of predictor variables in a linear regression model for the prediction of a dependent variable. It is based on a Bayesian approach, intended to be as objective as possible. A probability distribution is first assigned to the dependent variable through the specification of a family of prior distributions for the unknown parameters in the regression model. The method is not fully Bayesian, however, because the ultimate choice of prior distribution from this family is affected by the data. It is assumed that the predictors represent distinct observables; the corresponding regression coefficients are assigned independent prior distributions. For each regression coefficient subject to deletion from the model, the prior distribution is a mixture of a point mass at 0 and a diffuse uniform distribution elsewhere, that is, a “spike and slab” distribution. The random error component is assigned a normal distribution with mean 0 and standard deviation σ, where ln(σ) has a locally uniform noninformative prior distribution. The appropriate posterior probabilities are derived for each submodel. If the regression coefficients have identical priors, the posterior distribution depends only on the data and the parameter γ, which is the height of the spike divided by the height of the slab for the common prior distribution. This parameter is not assigned a probability distribution; instead, it is considered a parameter that indexes the members of a class of Bayesian methods. Graphical methods are proposed as informal guides for choosing γ, assessing the complexity of the response function and the strength of the individual predictor variables, and assessing the degree of uncertainty about the best submodel. The following plots against γ are suggested: (a) posterior probability that a particular regression coefficient is 0; (b) posterior expected number of terms in the model; (c) posterior entropy of the submodel distribution; (d) posterior predictive error; and (e) posterior probability of goodness of fit. Plots (d) and (e) are suggested as ways to choose y. The predictive error is determined using a Bayesian cross-validation approach that generates a predictive density for each observation, given all of the data except that observation, that is, a type of “leave one out” approach. The goodness-of-fit measure is the sum of the posterior probabilities of all submodels that pass a standard F test for goodness of fit relative to the full model, at a specified level of significance. The dependence of the results on the scaling of the variables is discussed, and some ways to choose the scaling constants are suggested. Examples based on a large data set arising from an energy-conservation study are given to demonstrate the application of the methods.
- Research Article
- 10.35508/ajes.v4i2.3533
- Dec 16, 2020
- Academic Journal of Educational Sciences
The concept of estimating a parameter is needed to help estimate a situation or observational data before making a decision. There are estimation methods that have been developed, namely the moment method, which is the oldest method, the maximum likelihood method (MLE), and the Bayes method, which is the latest method in determining the estimator of a parameter. Furthermore, the concept of forecasting is also one of the important ways to make a decision. Time series analysis technique (time series) is one of the forecasting methods that are often used, were specifically selected ARMA models. In the Bayesian approach, the parameters in the ARMA model are seen as quantities whose variance is represented by a probability distribution called a prior distribution. Within the framework of Bayes decision theory, estimator selection can be thought of as a problem of decision theory in uncertain circumstances. By using a multivariate Wishart normal prior distribution, the Bayesian estimator for is = z + u* and the Bayesian estimator for is: , with L = (1,0,…,0)’ and S* = S + Using the Gamma multivariate prior normal distribution, the Bayesian estimator for is = Z + uo and the Bayesian estimator for is : = , with u0 = (s u + R y), * = Forecasting one step ahead, namely: n(I)
- Research Article
2
- 10.1016/j.geoderma.2021.115396
- Sep 9, 2021
- Geoderma
A crucial decision in designing a spatial sample for soil survey is the number of sampling locations required to answer, with sufficient accuracy and precision, the questions posed by decision makers at different levels of geographic aggregation. In the Indian Soil Health Card (SHC) scheme, many thousands of locations are sampled per district. In this paper the SHC data are used to estimate the mean of a soil property within a defined study area, e.g., a district, or the areal fraction of the study area where some condition is satisfied, e.g., exceedence of a critical level. The central question is whether this large sample size is needed for this aim. The sample size required for a given maximum length of a confidence interval can be computed with formulas from classical sampling theory, using a prior estimate of the variance of the property of interest within the study area. Similarly, for the areal fraction a prior estimate of this fraction is required. In practice we are uncertain about these prior estimates, and our uncertainty is not accounted for in classical sample size determination (SSD). This deficiency can be overcome with a Bayesian approach, in which the prior estimate of the variance or areal fraction is replaced by a prior distribution. Once new data from the sample are available, this prior distribution is updated to a posterior distribution using Bayes’ rule. The apparent problem with a Bayesian approach prior to a sampling campaign is that the data are not yet available. This dilemma can be solved by computing, for a given sample size, the predictive distribution of the data, given a prior distribution on the population and design parameter. Thus we do not have a single vector with data values, but a finite or infinite set of possible data vectors. As a consequence, we have as many posterior distribution functions as we have data vectors. This leads to a probability distribution of lengths or coverages of Bayesian credible intervals, from which various criteria for SSD can be derived. Besides the fully Bayesian approach, a mixed Bayesian-likelihood approach for SSD is available. This is of interest when, after the data have been collected, we prefer to estimate the mean from these data only, using the frequentist approach, ignoring the prior distribution. The fully Bayesian and mixed Bayesian-likelihood approach are illustrated for estimating the mean of log-transformed Zn and the areal fraction with Zn-deficiency, defined as Zn concentration <0.9 mg kg −1, in the thirteen districts of Andhra Pradesh state. The SHC data from 2015–2017 are used to derive prior distributions. For all districts the Bayesian and mixed Bayesian-likelihood sample sizes are much smaller than the current sample sizes. The hyperparameters of the prior distributions have a strong effect on the sample sizes. We discuss methods to deal with this. Even at the mandal (sub-district) level the sample size can almost always be reduced substantially. Clearly SHC over-sampled, and here we show how to reduce the effort while still providing information required for decision-making. R scripts for SSD are provided as supplementary material.