Abstract
This paper addresses the case where data come as point sets, or more generally as discrete measures. Our motivation is twofold: first we intend to approximate with a compactly supported measure the mean of the measure generating process, that coincides with the intensity measure in the point process framework, or with the expected persistence diagram in the framework of persistence-based topological data analysis. To this aim we provide two algorithms that we prove almost minimax optimal. Second we build from the estimator of the mean measure a vectorization map, that sends every measure into a finite-dimensional Euclidean space, and investigate its properties through a clustering-oriented lens. In a nutshell, we show that in a mixture of measure generating process, our technique yields a representation in $\mathbb{R}^k$, for $k \in \mathbb{N}^*$ that guarantees a good clustering of the data points with high probability. Interestingly, our results apply in the framework of persistence-based shape classification via the ATOL procedure described in \cite{Royer19}.
Highlights
This paper handles the case where we observe n i.i.d measures X1, . . . , Xn on Rd, rather than n i.i.d sample points, the latter case being the standard input of many machine learning algorithms
The framework of i.i.d sample measures encompasses analysis of multi-channel time series, for instance in embankment dam anomaly detection from piezometers [21], as well as topological data analysis, where persistence diagrams naturally appear as discrete measures in R2 [8, 10]
The objective of the paper is the following: we want to build from data an embedding of the sample measures into a finite-dimensional Euclidean space that preserves clusters structure, if any
Summary
This paper handles the case where we observe n i.i.d measures X1, . . . , Xn on Rd, rather than n i.i.d sample points, the latter case being the standard input of many machine learning algorithms. The objective of the paper is the following: we want to build from data an embedding of the sample measures into a finite-dimensional Euclidean space that preserves clusters structure, if any Such an embedding, combined with standard clustering techniques should result in efficient measure clustering procedures. Xn in practice is too costly for large n’s, even with approximating algorithms [14, 31] To overcome these difficulties, we choose to define the central measure as the arithmetic mean of X, denoted by E(X), that assigns the weight E [X(A)] to a borelian set A. Xn. Section 3.3 investigates the special case where the measures are persistence diagrams built from samplings of different shapes, showing that all the previously exposed theoretical results apply in this framework. Proofs of intermediate and technical results are deferred to Appendix A
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.