Singleton indexes for nearest neighbor search

E.S Tellez,G Ruiz,E Chavez

doi:10.1016/j.is.2016.03.003

Abstract

The nearest neighbor search problem is fundamental in computer science, and in spite of the effort of a vast number of research groups, the instances allowing an efficient solution are reduced to databases of objects of small intrinsic dimensions. For intrinsically high-dimensional data, the only possible solution is to compromise and use approximate or probabilistic approaches. For the rest of the instances in the middle, there is an overwhelmingly large number of indexes of claimed good performance. However, the problem of parameter selection makes them unwieldy for use outside of the research community. Even if the indexes can be tuned correctly, either the number of operations for the index construction and tuning is prohibitively large or there are obscure parameters to tune-up. Those restrictions force users from different fields to use brute force to solve the problem in real world instances.In this paper, we present a family of indexing algorithms designed for end users. They require as input, the database, a query sample and the amount of space available. Our building blocks are standard discarding rules, and the indexes will add routing objects such as pivots, hyperplane references or cluster centroids. Those indexes are built incrementally and will self-tune by greedily searching for a global optimum in performance.We experimentally show that using this oblivious strategy our indexes are able to outperform state of the art, manually fine-tuned indexes. For example, our indexes are twice as fast than the fastest alternative (LC, EPT or VPT) for most of our datasets. In the case of LC, the faster alternative for high dimensional datasets, the difference is smaller than 5%. In the same case, our indexes are at least one order of magnitude faster to build. This superior performance is maintained for large, high dimensional datasets (100 million 12-dimensional objects). In this benchmark, our best index is two times faster than the closest alternative (VPT), six times faster than the majority of indexes, and more than sixty times faster than the sequential scan.

Full Text