Abstract

Hubness is an aspect of the curse of dimensionality related to the distance concentration effect. Hubs occur in high-dimensional data spaces as objects that are particularly often among the nearest neighbors of other objects. Conversely, other data objects become antihubs, which are rarely or never nearest neighbors to other objects. Many machine learning algorithms rely on nearest neighbor search and some form of measuring distances, which are both impaired by high hubness. Degraded performance due to hubness has been reported for various tasks such as classification, clustering, regression, visualization, recommendation, retrieval and outlier detection. Several hubness reduction methods based on different paradigms have previously been developed. Local and global scaling as well as shared neighbors approaches aim at repairing asymmetric neighborhood relations. Global and localized centering try to eliminate spatial centrality, while the related global and local dissimilarity measures are based on density gradient flattening. Additional methods and alternative dissimilarity measures that were argued to mitigate detrimental effects of distance concentration also influence the related hubness phenomenon. In this paper, we present a large-scale empirical evaluation of all available unsupervised hubness reduction methods and dissimilarity measures. We investigate several aspects of hubness reduction as well as its influence on data semantics which we measure via nearest neighbor classification. Scaling and density gradient flattening methods improve evaluation measures such as hubness and classification accuracy consistently for data sets from a wide range of domains, while centering approaches achieve the same only under specific settings.

Highlights

  • Learning in high-dimensional spaces is often challenging due to various phenomena that are commonly referred to as curse of dimensionality [4]

  • Since intraclass distances should generally be smaller than interclass distances, nearest neighbor classification accuracy can be used as a proxy to measure semantic correctness [21]

  • We present a comprehensive empirical evaluation of unsupervised hubness reduction methods, which showed promising results in previous studies, examining their ability to reduce hubness and whether they respect semantics of the data spaces at the same time

Read more

Summary

Introduction

Learning in high-dimensional spaces is often challenging due to various phenomena that are commonly referred to as curse of dimensionality [4]. Hubness is a related phenomenon of the dimensionality curse: in high-dimensional spaces, some objects are closer to the global centroid (unimodal data) or local centroid (multimodal data) [44] These objects often emerge as hubs with high k-occurrence, that is, they are among the k-nearest neighbors of many objects. Nearest neighbor relations in high hubness regimes are prone to semantic incorrectness: Hubs propagate their encoded information too widely in corresponding distance spaces, while information carried by antihubs is essentially lost [57]. These distance spaces do not reflect class information well, that is, semantic meaning of the data. Since intraclass distances should generally be smaller than interclass distances, nearest neighbor classification accuracy can be used as a proxy to measure semantic correctness [21]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call