High-dimensional count data clustering based on an exponential approximation to the multinomial Beta-Liouville distribution

Nuha Zamzami,Nizar Bouguila

doi:10.1016/j.ins.2020.03.028

Abstract

In this paper, we propose a mixture model for high-dimensional count data clustering based on an exponential-family approximation of the Multinomial Beta-Liouville distribution, which we call EMBL. We deal simultaneously with the problems of fitting the model to observed data and selecting the number of components. The learning algorithm automatically selects the optimal number of components and avoids several drawbacks of the standard Expectation-Maximization algorithm, including the sensitivity to initialization and possible convergence to the boundary of the parameter space. We demonstrate the effectiveness and robustness of the proposed clustering approach through a set of extensive empirical experiments that involve challenging real-world applications. The results reveal that the novel proposed model strives to achieve higher accuracy compared to the state-of-the-art generative models for count data clustering. Furthermore, the superior performance of EMBL demonstrates its flexibility and ability to address the burstiness phenomenon successfully, as well as shows its computational efficiency, especially when dealing with sparse high-dimensional vectors.

Full Text