Abstract

Clustering large and complex data sets whose partitions may adopt arbitrary shapes remains a difficult challenge. Part of this challenge comes from the difficulty in defining a similarity measure between the data points that captures the underlying geometry of those data points. In this paper, we propose an algorithm, DCG++ that generates such a similarity measure that is data-driven and ultrametric. DCG++ uses Markov Chain Random Walks to capture the intrinsic geometry of data, scans possible scales, and combines all this information using a simple procedure that is shown to generate an ultrametric. We validate the effectiveness of this similarity measure within the context of clustering on synthetic data with complex geometry, on a real-world data set containing segmented audio records of frog calls described by mel-frequency cepstral coefficients, as well as on an image segmentation problem. The experimental results show a significant improvement on performance with the DCG-based ultrametric compared to using an empirical distance measure.

Highlights

  • Given a set of objects O, usually referred to as data points, each characterized by some measured properties, or features D, it is natural to think of comparing them and possibly grouping them into categories, such that objects that belong to the same category are deemed to be more similar to each other than to objects in other categories

  • The resulting membership matrices are combined to generate a new distance matrix on the data. We note that this procedure bears similarity with the idea of a diffusion distance computed by the diffusion map algorithms [6], with the main difference that we explore the geometry of the data based on scanning over the parameter defining the local scale of the data, namely the temperature parameter in our approach, rather than scanning the extent with which the random walks are generated, namely the time parameter in the diffusion map algorithms

  • Exiting methods rely on different interpretation of the representation of the data points to be clustered, of the distance or similarity measures on those data, on the methods used to detect the manifolds on which those data lie, and even what defines clusters

Read more

Summary

Introduction

Given a set of objects O, usually referred to as data points, each characterized by some measured properties, or features D, it is natural to think of comparing them and possibly grouping them into categories, such that objects that belong to the same category are deemed to be more similar to each other than to objects in other categories. Most of the methods that implement a concept of a local metric rely on the construction of an -graph on the data, where is a parameter that defines the size of the neighborhood of a data point This parameter is either set to a bright cutoff, such as in the original implementation of ISOMAP [4], or to the width a of a Gaussian kernel, as it is usually implemented in spectral clustering techniques [13]. Following previously published preliminary studies [14, 15] we argue in this paper that exploring the range of possible values for the scale parameter allows us to automatically capture the hierarchical geometry of the data points under study, much akin to the persistent homology used in topological data analysis [10] Based on this idea, we proposed a method inspired from statistical physics that makes use of a temperature parameter T (equivalent to the parameter) to monitor phase transitions [14]. We conclude the paper with a discussion on future developments of the method itself

Related work
Method
Findings
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.