Choice Of Distance Measure Research Articles

The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While dozens of classification algorithms have been applied to time series, recent empirical evidence strongly suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm is important, and depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping, and cardiology data requires invariance to the baseline (the mean value). Similarly, recent work suggests that for time series clustering, the choice of clustering algorithm is much less important than the choice of distance measure used.In this work we make a somewhat surprising claim. There is an invariance that the community seems to have missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where some complex objects may be incorrectly assigned to a simpler class. Similarly, for clustering this effect can introduce errors by "suggesting" to the clustering algorithm that subjectively similar, but complex objects belong in a sparser and larger diameter cluster than is truly warranted.We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification and clustering accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series mining experiments ever attempted in a single work, and show that complexity-invariant distance measures can produce improvements in classification and clustering in the vast majority of cases.

Read full abstract

Summary Contamination of a sampled distribution, for example by a heavy-tailed distribution, can degrade the performance of a statistical estimator. We suggest a general approach to alleviating this problem, using a version of the weighted bootstrap. The idea is to ‘tilt’ away from the contaminated distribution by a given (but arbitrary) amount, in a direction that minimizes a measure of the new distribution's dispersion. This theoretical proposal has a simple empirical version, which results in each data value being assigned a weight according to an assessment of its influence on dispersion. Importantly, distance can be measured directly in terms of the likely level of contamination, without reference to an empirical measure of scale. This makes the procedure particularly attractive for use in multivariate problems. It has several forms, depending on the definitions taken for dispersion and for distance between distributions. Examples of dispersion measures include variance and generalizations based on high order moments. Practicable measures of the distance between distributions may be based on power divergence, which includes Hellinger and Kullback–Leibler distances. The resulting location estimator has a smooth, redescending influence curve and appears to avoid computational difficulties that are typically associated with redescending estimators. Its breakdown point can be located at any desired value ε∈ (0, ½) simply by ‘trimming’ to a known distance (depending only on ε and the choice of distance measure) from the empirical distribution. The estimator has an affine equivariant multivariate form. Further, the general method is applicable to a range of statistical problems, including regression.

Read full abstract

Choice Of Distance Measure Research Articles

Related Topics

Articles published on Choice Of Distance Measure

CID: an efficient complexity-invariant distance for time series

Supervised Distance Matrices

INFLUENCE OF SPATIOTEMPORAL SCALE ON THE INTERPRETATION OF PALEOCOMMUNITY STRUCTURE: LATERAL VARIATION IN THE IMPERIAL FORMATION OF CALIFORNIA

Reconstructing community relationships: the impact of sampling error, ordination approach, and gradient length

Preliminary comparison between microsatellite and AFLP multilocus genotypes for bovine breed assignment

Data structures and data transformations for clustering chemical data

Data structures and data transformations for clustering chemical data

Biased Bootstrap Methods for Reducing the Effects of Contamination

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Choice Of Distance Measure Research Articles

Related Topics

Articles published on Choice Of Distance Measure

CID: an efficient complexity-invariant distance for time series

Supervised Distance Matrices

INFLUENCE OF SPATIOTEMPORAL SCALE ON THE INTERPRETATION OF PALEOCOMMUNITY STRUCTURE: LATERAL VARIATION IN THE IMPERIAL FORMATION OF CALIFORNIA

Reconstructing community relationships: the impact of sampling error, ordination approach, and gradient length

Preliminary comparison between microsatellite and AFLP multilocus genotypes for bovine breed assignment

Data structures and data transformations for clustering chemical data

Data structures and data transformations for clustering chemical data

Biased Bootstrap Methods for Reducing the Effects of Contamination