Efficient Computation of k-Nearest Neighbour Graphs for Large High-Dimensional Data Sets on GPU Clusters

Ali Dashti,Roshan M D’Souza,Ivan Komarov,Attila Gursoy

doi:10.1371/journal.pone.0074113

Ali Dashti, Roshan M D’Souza + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0074113

Copy DOI

Journal: PloS one	Publication Date: Sep 23, 2013
Citations: 45	License type: CC BY 4.0

Affiliation: University of Wisconsin–Milwaukee

Abstract

This paper presents an implementation of the brute-force exact k-Nearest Neighbor Graph (k-NNG) construction for ultra-large high-dimensional data cloud. The proposed method uses Graphics Processing Units (GPUs) and is scalable with multi-levels of parallelism (between nodes of a cluster, between different GPUs on a single node, and within a GPU). The method is applicable to homogeneous computing clusters with a varying number of nodes and GPUs per node. We achieve a 6-fold speedup in data processing as compared with an optimized method running on a cluster of CPUs and bring a hitherto impossible -NNG generation for a dataset of twenty million images with 15 k dimensionality into the realm of practical possibility.

Highlights

K-Nearest neighbor graphs have a variety of applications in bioinformatics [1,2], data mining [3], machine learning [4,5], manifold learning [6], clustering analysis [7], and pattern recognition [8]
The k-Nearest Neighbor Graph (k-NNG) problem is similar to the k-NN problem and a k-NNG can be built by repeatedly applying the k-NN query for every object in the input data once a convenient search indexing data structure has been built
In this paper we describe our parallelized brute-force k-NNG algorithm on a cluster of graphics processing units

Summary

Introduction

K-Nearest neighbor graphs have a variety of applications in bioinformatics [1,2], data mining [3], machine learning [4,5], manifold learning [6], clustering analysis [7], and pattern recognition [8]. The k-NNG problem is similar to the k-NN problem and a k-NNG can be built by repeatedly applying the k-NN query for every object in the input data once a convenient search indexing data structure has been built Such search data structures include kd-trees [9], BBD-trees [10], random-projection trees (rp-trees) [11], and hashing based on locally sensitive hash [12]. These method focus on optimizing the k-NN search, i.e., finding k-NNs for a set of query points w.r.t. a set of points with which the search data structure is built, ignoring the fact that every query point is a data point. These methods are generally less efficient compared with one that focuses on k-NNG construction directly

Methods

Results

Conclusion