Abstract

This paper addresses the case where data come as point sets, or more generally as discrete measures. Our motivation is twofold: first we intend to approximate with a compactly supported measure the mean of the measure generating process, that coincides with the intensity measure in the point process framework, or with the expected persistence diagram in the framework of persistence-based topological data analysis. To this aim we provide two algorithms that we prove almost minimax optimal. Second we build from the estimator of the mean measure a vectorization map, that sends every measure into a finite-dimensional Euclidean space, and investigate its properties through a clustering-oriented lens. In a nutshell, we show that in a mixture of measure generating process, our technique yields a representation in $\mathbb{R}^k$, for $k \in \mathbb{N}^*$ that guarantees a good clustering of the data points with high probability. Interestingly, our results apply in the framework of persistence-based shape classification via the ATOL procedure described in \cite{Royer19}.

Highlights

  • This paper handles the case where we observe n i.i.d measures X1, . . . , Xn on Rd, rather than n i.i.d sample points, the latter case being the standard input of many machine learning algorithms

  • The framework of i.i.d sample measures encompasses analysis of multi-channel time series, for instance in embankment dam anomaly detection from piezometers [21], as well as topological data analysis, where persistence diagrams naturally appear as discrete measures in R2 [8, 10]

  • The objective of the paper is the following: we want to build from data an embedding of the sample measures into a finite-dimensional Euclidean space that preserves clusters structure, if any

Read more

Summary

Introduction

This paper handles the case where we observe n i.i.d measures X1, . . . , Xn on Rd, rather than n i.i.d sample points, the latter case being the standard input of many machine learning algorithms. The objective of the paper is the following: we want to build from data an embedding of the sample measures into a finite-dimensional Euclidean space that preserves clusters structure, if any Such an embedding, combined with standard clustering techniques should result in efficient measure clustering procedures. Xn in practice is too costly for large n’s, even with approximating algorithms [14, 31] To overcome these difficulties, we choose to define the central measure as the arithmetic mean of X, denoted by E(X), that assigns the weight E [X(A)] to a borelian set A. Xn. Section 3.3 investigates the special case where the measures are persistence diagrams built from samplings of different shapes, showing that all the previously exposed theoretical results apply in this framework. Proofs of intermediate and technical results are deferred to Appendix A

Vectorization and clustering of measures
Vectorization of measures
Discriminative codebooks and clustering of measures
Quantization of the mean measure
Batch and mini-batch algorithms
Theoretical guarantees
Application: clustering persistence diagrams
Experimental results
Measure clustering
Large-scale graph classification
Text classification with word embedding
Proof of Proposition 3
Proof of Proposition 5
Proof of Theorem 9
Proof of Theorem 10
Proof of Proposition 14
Proof of Proposition 15
Proof of Corollary 16
Proofs for Section 6
Proof of Lemma 17
Proof of Lemma 18
Proof of Lemma 22
Proof of Lemma 23
Proof of Lemma 24
Proof of Lemma 12
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call