Towards multi-purpose main-memory storage structures: Exploiting sub-space distance equalities in totally ordered data sets for exact knn queries

Martin Schäler,Christine Tex,Veit Köppen,David Broneske,Gunter Saake

doi:10.1016/j.is.2021.101791

Abstract

Efficient knn computation for high-dimensional data is an important, yet challenging task. Today, most information systems use a column-store back-end for relational data. For such systems, multi-dimensional indexes accelerating selections are known. However, they cannot be used to accelerate knn queries. Consequently, one relies on sequential scans, specialized knn indexes, or trades result quality for speed. To avoid storing one specialized index per query type, we envision multipurpose indexes allowing to efficiently compute multiple query types. In this paper, we focus on additionally supporting knn queries as first step towards this goal. To this end, we study how to exploit total orders for accelerating knn queries based on the sub-space distance equalities observation. It means that non-equal points in the full space, which are projected to the same point in a sub space, have the same distance to every other point in this sub space. In case one can easily find these equalities and tune storage structures towards them, this offers two effects one can exploit to accelerate knn queries. The first effect allows pruning of point groups based on a cascade of lower bounds. The second allows to re-use previously computed sub-space distances between point groups. This results in a worst-case execution bound, which is independent of the distance function. We present knn algorithms exploiting both effects and show how to tune a storage structure already known to work well for multi-dimensional selections. Our investigations reveal that the effects are robust to increasing, e.g., the dimensionality, suggesting generally good knn performance. Comparing our knn algorithms to well-known competitors reveals large performance improvements up to one order of magnitude. Furthermore, the algorithms deliver at least comparable performance as the next fastest competitor suggesting that the algorithms are only marginally affected by the curse of dimensionality.

Highlights

In the last decade, main-memory database systems have revolutionized analytical query processing of relational data
It is the only approach delivering, on average, a speedup compared to the baseline of sequential scanning the entire data set suggesting that one can generally expect good knn query performance
Towards using the same index for multiple query types, we study the concept of sub-space distance equalities yielding two effects for efficient knn computation, namely group lower bound effect and re-use effect of sub-space distances

Summary

Introduction

Main-memory database systems have revolutionized analytical query processing of relational data. One exploits advances in hardware reducing the cost for vising all points by orders of magnitude compared to hard-disk environments. This makes sequential scans a powerful competitor. For realistic data dimensionality as found, e.g., in the UCI archive [8], such approaches visit a large fraction of the data They even deteriorate to a sequential scan. In case dimensionality is high, the only approach avoiding the deterioration problem is result approximation [9], i.e., trading result accuracy to improve response time. Result approximation may not be valid for all use cases

Objectives

Methods

Results

Conclusion