Abstract

When constructing discrete (binned) distributions from samples of a data set, applications exist where it is desirable to assure that all bins of the sample distribution have nonzero probability. For example, if the sample distribution is part of a predictive model for which we require returning a response for the entire codomain, or if we use Kullback–Leibler divergence to measure the (dis-)agreement of the sample distribution and the original distribution of the variable, which, in the described case, is inconveniently infinite. Several sample-based distribution estimators exist which assure nonzero bin probability, such as adding one counter to each zero-probability bin of the sample histogram, adding a small probability to the sample pdf, smoothing methods such as Kernel-density smoothing, or Bayesian approaches based on the Dirichlet and Multinomial distribution. Here, we suggest and test an approach based on the Clopper–Pearson method, which makes use of the binominal distribution. Based on the sample distribution, confidence intervals for bin-occupation probability are calculated. The mean of each confidence interval is a strictly positive estimator of the true bin-occupation probability and is convergent with increasing sample size. For small samples, it converges towards a uniform distribution, i.e., the method effectively applies a maximum entropy approach. We apply this nonzero method and four alternative sample-based distribution estimators to a range of typical distributions (uniform, Dirac, normal, multimodal, and irregular) and measure the effect with Kullback–Leibler divergence. While the performance of each method strongly depends on the distribution type it is applied to, on average, and especially for small sample sizes, the nonzero, the simple “add one counter”, and the Bayesian Dirichlet-multinomial model show very similar behavior and perform best. We conclude that, when estimating distributions without an a priori idea of their shape, applying one of these methods is favorable.

Highlights

  • Suppose a scientist, having gathered extensive data at one site, wants to know whether the same effort is required at each new site, or whether already a smaller data set would have provided essentially the same information

  • Entropy 2018, 20, 601 of vegetation classes at a site, or the distribution of forecasted rainfall), the representativeness of a subset of the data can be evaluated by measuring theagreement of a distribution based on a randomly drawn sample (“sample distribution”) and the distribution based on the full data set

  • As the standard BC approach does not guarantee this, we proposed an alternative approach based on the Clopper–Pearson method, which makes use of the binominal distribution

Read more

Summary

Introduction

Suppose a scientist, having gathered extensive data at one site, wants to know whether the same effort is required at each new site, or whether already a smaller data set would have provided essentially the same information. Working with ensemble forecasts usually involves handling considerable amounts of data, and the forecaster might be interested to know whether working with a subset of the ensemble is sufficient to capture the essential characteristics of the ensemble. Entropy 2018, 20, 601 of vegetation classes at a site, or the distribution of forecasted rainfall), the representativeness of a subset of the data can be evaluated by measuring the (dis-)agreement of a distribution based on a randomly drawn sample (“sample distribution”) and the distribution based on the full data set (“full distribution”). Depending on the particular interest of the user, potential advantages of this measure are that it is nonparametric, which avoids parameter choices influencing the result, and that it measures general agreement of the distributions instead of focusing on particular aspects, e.g., particular moments

Objectives
Methods
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.