Finding the mean in a partition distribution

Thomas J Glassen,Timo Von Oertzen,Dmitry A Konovalov

doi:10.1186/s12859-018-2359-z

Thomas J Glassen, Timo Von Oertzen + Show 1 more

Open Access

PDF Available

https://doi.org/10.1186/s12859-018-2359-z

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundBayesian clustering algorithms, in particular those utilizing Dirichlet Processes (DP), return a sample of the posterior distribution of partitions of a set. However, in many applied cases a single clustering solution is desired, requiring a ’best’ partition to be created from the posterior sample. It is an open research question which solution should be recommended in which situation. However, one such candidate is the sample mean, defined as the clustering with minimal squared distance to all partitions in the posterior sample, weighted by their probability. In this article, we review an algorithm that approximates this sample mean by using the Hungarian Method to compute the distance between partitions. This algorithm leaves room for further processing acceleration.ResultsWe highlight a faster variant of the partition distance reduction that leads to a runtime complexity that is up to two orders of magnitude lower than the standard variant. We suggest two further improvements: The first is deterministic and based on an adapted dynamical version of the Hungarian Algorithm, which achieves another runtime decrease of at least one order of magnitude. The second improvement is theoretical and uses Monte Carlo techniques and the dynamic matrix inverse. Thereby we further reduce the runtime complexity by nearly the square root of one order of magnitude.ConclusionsOverall this results in a new mean partition algorithm with an acceleration factor reaching beyond that of the present algorithm by the size of the partitions. The new algorithm is implemented in Java and available on GitHub (Glassen, Mean Partition, 2018).

Highlights

Bayesian clustering algorithms, in particular those utilizing Dirichlet Processes (DP), return a sample of the posterior distribution of partitions of a set
The tool uses a DP mixture model (DPMM) and an approximation method to determine the mean of the generated samples
One could use Variational Bayes instead of Markov chain Monte Carlo (MCMC) Sampling or replace the mean partition approximation by an alternative consensus clustering algorithm (e.g., CC-Pivot [6]). Both strategies lead to faster procedures relatively but in many cases the accuracy of the calculated means can be severely impaired. This applies to both MCMC versus Variational Bayes [7] and mean partition approximation versus other consensus clustering approaches [8, 9]

Summary

Introduction

In particular those utilizing Dirichlet Processes (DP), return a sample of the posterior distribution of partitions of a set. One could use Variational Bayes instead of Markov chain Monte Carlo (MCMC) Sampling or replace the mean partition approximation by an alternative consensus clustering algorithm (e.g., CC-Pivot [6]). Both strategies lead to faster procedures relatively but in many cases the accuracy of the calculated means can be severely impaired. This applies to both MCMC versus Variational Bayes [7] and mean partition approximation versus other consensus clustering approaches [8, 9]. Our resulting algorithm offers the same accuracy as the original method at a significantly lower runtime complexity

Methods

Results

Conclusion