Abstract

This paper introduces scikit-hubness, a Python package for efficient nearest neighbor search in high-dimensional spaces. Hubness is an aspect of the curse of dimensionality, and is known to impair various learning tasks, including classification, clustering, and visualization. scikit-hubness provides algorithms for hubness analysis ("Is my data affected by hubness?"), hubness reduction ("How can we improve neighbor retrieval in high dimensions?"), and approximate neighbor search ("Does it work for large data sets?"). It is integrated into the scikit-learn environment, enabling rapid adoption by Python-based machine learning researchers and practitioners. Users will find all functionality of the scikit-learn neighbors package, plus additional support for transparent hubness reduction and approximate nearest neighbor search. scikit-hubness is developed using several quality assessment tools and principles, such as PEP8 compliance, unit tests with high code coverage, continuous integration on all major platforms (Linux, MacOS, Windows), and additional checks by LGTM. The source code is available at https://github.com/VarIr/scikit-hubness under the BSD 3-clause license. Install from the Python package index with $ pip install scikit-hubness.

Highlights

  • Summary scikit-hubness is a Python package for efficient nearest neighbor search in high-dimensional spaces

  • Hubness is an aspect of the curse of dimensionality in nearest neighbor graphs

  • Approximate neighbor search (“Does it work for large data sets?”): Several methods are currently available, including locality-sensitive hashing (Aumüller, Christiani, Pagh, & Vesterli, 2019) and hierarchical navigable small-world graphs (Malkov & Yashunin, 2018)

Read more

Summary

Introduction

Summary scikit-hubness is a Python package for efficient nearest neighbor search in high-dimensional spaces. Hubness is an aspect of the curse of dimensionality in nearest neighbor graphs. It describes the increasing occurrence of hubs and antihubs with growing data dimensionality: Hubs are objects, that appear unexpectedly often among the nearest neighbors of others objects, while antihubs are never retrieved as neighbors. These semantically distorted graphs can reduce learning performance in various tasks, such as classification (Radovanović, Nanopoulos, & Ivanović, 2010), clustering (Schnitzer & Flexer, 2015), or visualization (Flexer, 2015).

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call