Numerous clustering algorithms prioritize accuracy, but in high-risk domains, the interpretability of clustering methods is crucial as well. The inherent heterogeneity of categorical data makes it particularly challenging for users to comprehend clustering outcomes. Currently, the majority of interpretable clustering methods are tailored for numerical data and utilize decision tree models, leaving interpretable clustering for categorical data as a less explored domain. Additionally, existing interpretable clustering algorithms often depend on external, potentially non-interpretable algorithms and lack transparency in the decision-making process during tree construction. In this paper, we tackle the problem of interpretable categorical data clustering by growing a decision tree in a statistically meaningful manner. We formulate the evaluation of candidate splits as a multivariate two-sample testing problem, where a single p-value is derived by combining significance evidence from all individual categories. This approach provides a reliable and controllable method for selecting the optimal split while determining its statistical significance. Extensive experimental results on real-world data sets demonstrate that our algorithm achieves comparable performance in terms of cluster quality, running efficiency, and explainability relative to its counterparts.
Read full abstract