DCG++: A data-driven metric for geometric pattern recognition.

Jiahui Guan,Fushing Hsieh,Patrice Koehl

doi:10.1371/journal.pone.0217838

Jiahui Guan, Fushing Hsieh + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0217838

Copy DOI

Journal: PLOS ONE	Publication Date: Jun 6, 2019
Citations: 2	License type: CC BY 4.0

Affiliation: University of California, Davis

Abstract

Clustering large and complex data sets whose partitions may adopt arbitrary shapes remains a difficult challenge. Part of this challenge comes from the difficulty in defining a similarity measure between the data points that captures the underlying geometry of those data points. In this paper, we propose an algorithm, DCG++ that generates such a similarity measure that is data-driven and ultrametric. DCG++ uses Markov Chain Random Walks to capture the intrinsic geometry of data, scans possible scales, and combines all this information using a simple procedure that is shown to generate an ultrametric. We validate the effectiveness of this similarity measure within the context of clustering on synthetic data with complex geometry, on a real-world data set containing segmented audio records of frog calls described by mel-frequency cepstral coefficients, as well as on an image segmentation problem. The experimental results show a significant improvement on performance with the DCG-based ultrametric compared to using an empirical distance measure.

Highlights

Given a set of objects O, usually referred to as data points, each characterized by some measured properties, or features D, it is natural to think of comparing them and possibly grouping them into categories, such that objects that belong to the same category are deemed to be more similar to each other than to objects in other categories
The resulting membership matrices are combined to generate a new distance matrix on the data. We note that this procedure bears similarity with the idea of a diffusion distance computed by the diffusion map algorithms [6], with the main difference that we explore the geometry of the data based on scanning over the parameter defining the local scale of the data, namely the temperature parameter in our approach, rather than scanning the extent with which the random walks are generated, namely the time parameter in the diffusion map algorithms
Exiting methods rely on different interpretation of the representation of the data points to be clustered, of the distance or similarity measures on those data, on the methods used to detect the manifolds on which those data lie, and even what defines clusters

Summary

Introduction

Given a set of objects O, usually referred to as data points, each characterized by some measured properties, or features D, it is natural to think of comparing them and possibly grouping them into categories, such that objects that belong to the same category are deemed to be more similar to each other than to objects in other categories. Most of the methods that implement a concept of a local metric rely on the construction of an -graph on the data, where is a parameter that defines the size of the neighborhood of a data point This parameter is either set to a bright cutoff, such as in the original implementation of ISOMAP [4], or to the width a of a Gaussian kernel, as it is usually implemented in spectral clustering techniques [13]. Following previously published preliminary studies [14, 15] we argue in this paper that exploring the range of possible values for the scale parameter allows us to automatically capture the hierarchical geometry of the data points under study, much akin to the persistent homology used in topological data analysis [10] Based on this idea, we proposed a method inspired from statistical physics that makes use of a temperature parameter T (equivalent to the parameter) to monitor phase transitions [14]. We conclude the paper with a discussion on future developments of the method itself

Related work

Method

Findings

Discussion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DCG++: A data-driven metric for geometric pattern recognition.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

New diagonal bundle method for clustering problems in large data sets
Napsu Karmitsa ... Sona Taheri
European Journal of Operational Research | VOL. 263
Napsu Karmitsa, et. al.Napsu Karmitsa ... Sona Taheri
10 Jun 2017
European Journal of Operational Research | VOL. 263

LERI: Local Exploration for Rare-Category Identification
Hao Huang ... Huaizhong Lin
IEEE Transactions on Knowledge and Data Engineering | VOL. 32
Hao Huang, et. al.Hao Huang ... Huaizhong Lin
01 Jan 2020
IEEE Transactions on Knowledge and Data Engineering | VOL. 32

Clustering in large data sets with the limited memory bundle method
Napsu Karmitsa ... Sona Taheri
Pattern Recognition | VOL. 83
Napsu Karmitsa, et. al.Napsu Karmitsa ... Sona Taheri
31 May 2018
Pattern Recognition | VOL. 83

Applications of clustering algorithms and self organizing maps as data mining and business intelligence tools on real world data sets
L Singh ... P K Dubey
-
L Singh, et. al.L Singh ... P K Dubey
01 Dec 2010
01 Dec 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DCG++: A data-driven metric for geometric pattern recognition.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE