Towards U-statistics clustering inference for multiple groups

Debora Zava Bello,Marcio Valk,Gabriela Bettella Cybis

doi:10.1080/00949655.2023.2239978

Abstract

Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods that assess statistical significance have recently drawn attention due to their role in identifying patterns in high-dimensional data with applications in many scientific fields. Towards developing a general framework for clustering in multiple groups, we present here a U-statistics-based approach, specially tailored for high-dimensional datasets, that clusters the data into three groups while assessing the significance of such partitions. We also consider theoretical aspects of allowing for an outlier group. Our approach stands on the U-statistics-based clustering framework of the methods in R package uclust and inherits its properties being a non-parametric method relying on very few assumptions about the data. Thus it can be applied to a wide range of datasets. Furthermore our method aims to be a statistically powerful tool to find the best partitions of the data into three groups when that particular structure is present. To do so, we first propose an extension of the test U-statistic and develop its asymptotic theory. Additionally we propose a ternary non-nested significance clustering method. Our approach is tested through multiple simulations and is shown to be comparable or have more statistical power to competing alternatives in all scenarios considered. An application to image recognition data showcases our method.

Full Text