Abstract

In this paper, we study the behaviour of the so-called k-simplicial distances and k-minimal-variance distances between a point and a sample. The family of k-simplicial distances includes the Euclidean distance, the Mahalanobis distance, Oja’s simplex distance and many others. We give recommendations about the choice of parameters used to calculate the distances, including the size of the sub-sample of simplices used to improve computation time, if needed. We introduce a new family of distances which we call k-minimal-variance distances. Each of these distances is constructed using polynomials in the sample covariance matrix, with the aim of providing an alternative to the inverse covariance matrix, that is applicable when data is degenerate. We explore some applications of the considered distances, including outlier detection and clustering, and compare how the behaviour of the distances is affected for different parameter choices.

Highlights

  • The Mahalanobis distance is one of the most useful tools in multivariate data science, underpinning a huge variety of practical data analysis methods

  • We explore the choice of the parameter k, and show that k can be relatively low to produce good results, making the k-minimal-variance distance a quick and viable alternative to the Mahalanobis distance

  • We prove the following theorem comparing the variance of the squared Euclidean distance, Mahalanobis distance and k-simplicial distance with k = 2 and δ = 2

Read more

Summary

Introduction

The Mahalanobis distance is one of the most useful tools in multivariate data science, underpinning a huge variety of practical data analysis methods. This distance measures the proximity of a point x ∈ Rd to a d-dimensional set of points X = {x1, . It was introduced in Mahalanobis [27]. The Mahalanobis distance corresponds to the Euclidean distance in the standardized space where variables are uncorrelated

Page 2 of 30
Page 4 of 30
Page 6 of 30
Page 8 of 30
Choosing k in the k-Simplicial Distance
Page 10 of 30
Numerical Computation of k-Simplicial Distances Using Sub-Sampling
Page 12 of 30
Outlier Labelling Example
Page 16 of 30
Page 18 of 30
Efficiency of k-Minimal-Variance Distances Compared to k-Simplicial Distances
Page 20 of 30
Page 22 of 30
Page 24 of 30
Conclusion
Page 26 of 30
Page 28 of 30
Findings
Page 30 of 30

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.