Clustering of measures via mean measure quantization

Frédéric Chazal,Martin Royer,Clément Levrard

doi:10.1214/21-ejs1834

Abstract

This paper addresses the case where data come as point sets, or more generally as discrete measures. Our motivation is twofold: first we intend to approximate with a compactly supported measure the mean of the measure generating process, that coincides with the intensity measure in the point process framework, or with the expected persistence diagram in the framework of persistence-based topological data analysis. To this aim we provide two algorithms that we prove almost minimax optimal. Second we build from the estimator of the mean measure a vectorization map, that sends every measure into a finite-dimensional Euclidean space, and investigate its properties through a clustering-oriented lens. In a nutshell, we show that in a mixture of measure generating process, our technique yields a representation in $\mathbb{R}^k$, for $k \in \mathbb{N}^*$ that guarantees a good clustering of the data points with high probability. Interestingly, our results apply in the framework of persistence-based shape classification via the ATOL procedure described in \cite{Royer19}.

Highlights

This paper handles the case where we observe n i.i.d measures X1, . . . , Xn on Rd, rather than n i.i.d sample points, the latter case being the standard input of many machine learning algorithms
The framework of i.i.d sample measures encompasses analysis of multi-channel time series, for instance in embankment dam anomaly detection from piezometers [21], as well as topological data analysis, where persistence diagrams naturally appear as discrete measures in R2 [8, 10]
The objective of the paper is the following: we want to build from data an embedding of the sample measures into a finite-dimensional Euclidean space that preserves clusters structure, if any

Summary

Introduction

This paper handles the case where we observe n i.i.d measures X1, . . . , Xn on Rd, rather than n i.i.d sample points, the latter case being the standard input of many machine learning algorithms. The objective of the paper is the following: we want to build from data an embedding of the sample measures into a finite-dimensional Euclidean space that preserves clusters structure, if any Such an embedding, combined with standard clustering techniques should result in efficient measure clustering procedures. Xn in practice is too costly for large n’s, even with approximating algorithms [14, 31] To overcome these difficulties, we choose to define the central measure as the arithmetic mean of X, denoted by E(X), that assigns the weight E [X(A)] to a borelian set A. Xn. Section 3.3 investigates the special case where the measures are persistence diagrams built from samplings of different shapes, showing that all the previously exposed theoretical results apply in this framework. Proofs of intermediate and technical results are deferred to Appendix A

Vectorization and clustering of measures

Vectorization of measures

Discriminative codebooks and clustering of measures

Quantization of the mean measure

Batch and mini-batch algorithms

Theoretical guarantees

Application: clustering persistence diagrams

Experimental results

Measure clustering

Large-scale graph classification

Text classification with word embedding

Proof of Proposition 3

Proof of Proposition 5

Proof of Theorem 9

Proof of Theorem 10

Proof of Proposition 14

Proof of Proposition 15

Proof of Corollary 16

Proofs for Section 6

Proof of Lemma 17

Proof of Lemma 18

Proof of Lemma 22

Proof of Lemma 23

Proof of Lemma 24

Proof of Lemma 12

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronic Journal of Statistics	Publication Date: Jan 1, 2021
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Clustering of measures via mean measure quantization

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronic Journal of Statistics

Lead the way for us

Similar Papers

An Alternative Surface Measures Construction in Finite-Dimensional Spaces and its Consistency with the Classical Approach
Kateryna V Moravetska
Research Bulletin of the National Technical University of Ukraine "Kyiv Politechnic Institute" | VOL. 0
Kateryna V MoravetskaKateryna V Moravetska
18 Sep 2017
Research Bulletin of the National Technical University of Ukraine "Kyiv Politechnic Institute" | VOL. 0

On the Structure of Optimal Transportation Plans between Discrete Measures
Gennaro Auricchio ... Marco Veneroni
Applied Mathematics & Optimization | VOL. 85
Gennaro Auricchio, et. al.Gennaro Auricchio ... Marco Veneroni
10 May 2022
Applied Mathematics & Optimization | VOL. 85

Smooth Approximation of Lipschitz Maps and Their Subgradients
Abbas Edalat
Journal of the ACM | VOL. 69
Abbas EdalatAbbas Edalat
22 Dec 2021
Journal of the ACM | VOL. 69

Two convex counterexamples: A discontinuous envelope function and a nondifferentiable nearest-point mapping
J B Kruskal
Proceedings of the American Mathematical Society | VOL. 23
J B KruskalJ B Kruskal
01 Mar 1969
Proceedings of the American Mathematical Society | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Clustering of measures via mean measure quantization

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronic Journal of Statistics