Abstract

Flexible and reliable probability density estimation is fundamental in unsupervised learning and classification. Finite Gaussian mixture models are commonly used for this purpose. However, the parametric form of the distribution is not always known. In this case, non-parametric density estimation methods are used. Usually, these methods become computationally demanding as the number of components increases. In this paper, a comparative study of accuracy of some nonparametric density estimators is made by means of simulation. The following approaches have been considered: an adaptive bandwidth kernel estimator, a projection pursuit estimator, a logspline estimator, and a k-nearest neighbor estimator. It was concluded that data clustering as a pre-processing step improves the estimation of mixture densities. However, in case data does not have clearly defined clusters, the pre-preprocessing step does not give that much of advantage. The application of density estimators is illustrated using municipal solid waste data collected in Kaunas (Lithuania). The data distribution is similar (i.e., with kurtotic unimodal density) to the benchmark distribution introduced by Marron and Wand. Based on the homogeneity tests it can be concluded that distributions of the municipal solid waste fractions in Kutaisi (Georgia), Saint-Petersburg (Russia), and Boryspil (Ukraine) are statistically indifferent compared to the distribution of waste fractions in Kaunas. The distribution of waste data collected in Kaunas (Lithuania) follows the general observations introduced by Marron and Wand (i.e., has one mode and certain kurtosis).

Highlights

  • The problem under consideration is closely related to distribution analysis

  • The comparative study of accuracy of density estimation algorithms has been performed based on four different types of statistics, which are used by other researchers, as well

  • This study has shown that the densities of Kutaisi, Saint-Petersburg, and Boryspil municipal solid waste fractions are similar to the densities of Kaunas municipal solid waste fractions

Read more

Summary

Introduction

The problem under consideration is closely related to distribution analysis. Which is an important branch of data analysis and is being used to solve various other problems (discriminant analysis, image recognition, etc.). In case data distribution is multimodal and the sample size is small, in practice, it is not easy to choose a robust density estimation method. The following new research areas are investigated: 1) new density estimation methods based on inverse formula are formulated, 2) in order to compare the robustness of estimators the wide set of distributions proposed by Marron and Wand [26] is used to study densities, 3) in order to obtain results with a reasonably high level of confidence a relatively high (100000) number of independent samples have been generated. A pilot comparative study of several non-parametric estimators accuracy [32] showed that the Friedman procedure is more robust in the majority of examined Gaussian multivariate mixtures cases where the components can be separated.

Sample clustering with the EM algorithm
The density estimation algorithms analysed
The analysis of estimation accuracy
Conclusions and future work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call