Variance-based clustering methods and higher order data transformations and their applications

Casimir A Kulikowski ,Nikita I Lytkin

doi:10.7282/t3b27vfq

Abstract

Two approaches have been proposed in statistical and machine learning communities in order to address the problem of uncovering clusters with complex structure. One approach relies on the development of clustering criteria that are able to accommodate increasingly complex characteristics of the data. The other approach is based on simplification of structure of data by mapping it to a different feature space via a non-linear function and then clustering in the new space. This dissertation covers three related studies: development of a novel multi-dimensional clustering method, development of non-linear mapping functions that leverage higher-order co-occurrences between features in boolean data, and applications of these mapping functions for improving the performance of clustering methods. In particular, we treat clustering as a combinatorial optimization problem of finding a partition of the data so as to minimize a certain criterion. We develop a novel multi-dimensional clustering method based on a statistically-motivated criterion proposed by J. Neyman for stratified sampling from one-dimensional data. We show that this criterion is more reflective of the underlying data structure than the seemingly similar K-means criterion when second order variability is not homogeneous between constituent subgroups. Furthermore, experimental results demonstrate that generalization of the Neyman's criterion to multi-dimensional spaces and development of the associated clustering algorithm allow for statistically efficient estimation of the grand mean vector of a population. In the framework of the mapping-based approach to discovering complex cluster structures, we introduced a novel adaptive non-linear data transformation termed Unsupervised Second Order Transformation (USOT). The novelties behind USOT are (a) that it leverages in a unsupervised manner, higher-order co-occurrences between features in boolean data, and (b) that it considers each feature in the context of probabilistic relationships with other features. In addition, USOT has two desirable properties. USOT adaptively selects features that would influence the mapping of a given feature, and preserves the interpretability of dimensions of the transformed space. Experimental results on text corpora and financial time series demonstrate that by leveraging higher-order co-occurrences between features, clustering methods achieved statistically significant improvements in USOT space over the original boolean space.

Full Text