Graph Based K-Nearest Neighbor Search Revisited
The problem of k -nearest neighbor ( k -NN) search is a fundamental problem to find the exact k nearest neighbor points for a user-given query point q in a d -dimensional large dataset D with n points, and the approximate k -NN ( k -ANN) search problem is to find the approximate k -NN. Both are extensively studied to support real applications. Among all approaches, the graph-based approaches have been seen as the best to support k -NN/ANN in recent studies. The state-of-the-art graph-based approach, τ-MG, finds 1-NN, \(\bar{p}_1\) , over a graph index G τ constructed for D based on a predetermined parameter τ where the distance between \(\bar{p}_1\) and q is less than τ, and finds k -ANN based on the approach taken for 1-NN. There are some main issues in τ-MG and other graph-based approaches. One is that it is difficult to predetermine τ which can ensure to find 1-NN and can do it efficiently. This is because the accuracy/efficiency is related to the size of the graph index G τ constructed. To achieve high accuracy is at the expense of efficiency. In addition, like all the other existing graph-based approaches, it does not have a theoretical guarantee to ensure k -NN for the same reason to use the same graph index, G τ , for both 1-NN and k -NN ( k > 1). In this article, we propose a new graph-based approach for k -NN with a theoretical guarantee. We construct a labeled graph, \(\mathcal {G}\) , and we do not need to predetermine τ. Instead, we find 1-NN over a subgraph, \(\mathcal {G}_{\dot{\tau }}\) , of \(\mathcal {G}\) , virtually constructed in a dynamic manner. Here, \(\dot{\tau }\) we use is query-dependent and can be smaller than τ, and the subgraph \(\mathcal {G}_{\dot{\tau }}\) is smaller than G τ when \(\dot{\tau }= \tau\) . We find k -NN in two phases. In the navigation phase, we find 1-NN, \(\bar{p}_1\) , of q over \(\mathcal {G}_{\dot{\tau }}\) . In the second refinement phase, for k > 1, we explore the neighbors within the vicinity region of \(\bar{p}_1\) in \(\mathcal {G}\) . Based on our solution for k -NN in theory, we propose new algorithms to support k -ANN efficiently in practice. We conduct extensive performance studies and confirm the effectiveness and efficiency of our new approach.
535
- 10.1145/1963405.1963487
- Mar 28, 2011
20
- 10.1109/acssc.1988.754602
- Jan 1, 1988
186
- 10.1145/2213836.2213898
- May 20, 2012
41
- 10.1016/j.patcog.2019.106970
- Jul 15, 2019
- Pattern Recognition
122
- 10.14778/3476249.3476255
- Jul 1, 2021
- Proceedings of the VLDB Endowment
29
- 10.14778/3594512.3594527
- Apr 1, 2023
- Proceedings of the VLDB Endowment
43
- 10.1145/1989323.1989428
- Jun 12, 2011
202
- 10.14778/3303753.3303754
- Jan 1, 2019
- Proceedings of the VLDB Endowment
25
- 10.1109/tpami.2018.2853161
- Jul 5, 2018
- IEEE Transactions on Pattern Analysis and Machine Intelligence
26
- 10.1145/3543507.3583552
- Apr 30, 2023
- Research Article
- 10.3844/jcssp.2012.1358.1363
- Aug 1, 2012
- Journal of Computer Science
Problem statement: A database that is optimized to store and query data that is related to objects in space, including points, lines and polygons is called spatial database. Identifying nearest neighbor object search is a vital part of spatial database. Many nearest neighbor search techniques such as Authenticated Multi-step NN (AMNN), Superseding Nearest Neighbor (SNN) search, Bayesian Nearest Neighbor (BNN) and so on are available. But they had some difficulties while performing NN in uncertain spatial database. AMNN does not process the queries from distributed server and it accesses the queries only from single server. In SNN, the high dimensional data structure could not be used in NN search and it accesses only low dimensional data for NN search. Approach: The previous works described the process of NN using SNN with marginal object weight ranking. The downside over the previous work is that the performance is poor when compared to another work which performed NN using BNN. To improve the NN search in spatial databases using BNN, we are going to present a new technique as BNN search using marginal object weight ranking. Based on events occurring in the nearest object, BNN starts its search using MOW. The MOW is done by computing the weight of each NN objects and rank each object based on its frequency and distance of NN object for an efficient NN search in spatial databases. Results: Marginal Object Weight (MOW) is introduced to all nearest neighbor object identified using BNN for any relevant query point. It processes the queries from distributed server using MOW. Conclusion: The proposed BNN using MOW framework is experimented with real data sets to show the performance improvement with the previous MOW using SNN in terms of execution time, memory consumption and query result accuracy.
- Conference Article
- 10.1109/iccvw.2009.5457540
- Sep 1, 2009
Nearest Neighbor (NN) search plays important roles in Computer Vision algorithms. Especially, NN search on immensely large amount of image data set stored on the Internet is getting highlighted. For dealing with such huge data, main memory of a single PC is insufficient. As a solution, we propose an approximate NN search on hard disk drive (HDD) in this paper. This algorithm is based on recently proposed Principal Component Hashing (PCH). In our algorithm “PCH on HDD” (PCHD), the hash bins are represented by the leaf nodes of B+ tree for dealing with the dynamic addition and deletion of the data. Of course, the search time is slower than the original PCH. However, we found some advantages of this approach through the experiments using standard PC and 10000 stored images: 1) the memory consumption is 42 times smaller, 2) the first search time including the cold start-up time is 4.3 times faster (PCH:31.8[s], PCHD: 7.4[s]), 3) and interestingly, the successive searches are accelerated owing to the cache mechanism embedded in the operating system (mean search time decreases from 7.4[s] to 0.64[s]). We also confirmed that our algorithm performs NN search on 1 million image datasets with only 193MB memory consumption; however, PCH cannot, because of the huge memory consumption. These properties reveal that this algorithm is suitable for non-time-critical NN search applications and NN search engine called by web servers, where the search engine starts up in response to occasional queries.
- Research Article
8
- 10.1109/tip.2021.3066907
- Jan 1, 2021
- IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Hashing methods have been widely used in Approximate Nearest Neighbor (ANN) search for big data due to low storage requirements and high search efficiency. These methods usually map the ANN search for big data into the k -Nearest Neighbor ( k NN) search problem in Hamming space. However, Hamming distance calculation ignores the bit-level distinction, leading to confusing ranking. In order to further increase search accuracy, various bit-level weights have been proposed to rank hash codes in weighted Hamming space. Nevertheless, existing ranking methods in weighted Hamming space are almost based on exhaustive linear scan, which is time consuming and not suitable for large datasets. Although Multi-Index hashing that is a sub-linear search method has been proposed, it relies on Hamming distance rather than weighted Hamming distance. To address this issue, we propose an exact k NN search approach with Multiple Tables in Weighted Hamming space named WHMT, in which the distribution of bit-level weights is incorporated into the multi-index building. By WHMT, we can get the optimal candidate set for exact k NN search in weighted Hamming space without exhaustive linear scan. Experimental results show that WHMT can achieve dramatic speedup up to 69.8 times over linear scan baseline without losing accuracy in weighted Hamming space.
- Research Article
103
- 10.1145/1806907.1806912
- Jul 1, 2010
- ACM Transactions on Database Systems
Nearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase sublinearly with the dataset size, regardless of the data and query distributions. Locality-Sensitive Hashing (LSH) is a well-known methodology fulfilling both requirements, but its current implementations either incur expensive space and query cost, or abandon its theoretical guarantee on the quality of query results. Motivated by this, we improve LSH by proposing an access method called the Locality-Sensitive B-tree (LSB-tree) to enable fast, accurate, high-dimensional NN search in relational databases. The combination of several LSB-trees forms a LSB-forest that has strong quality guarantees, but improves dramatically the efficiency of the previous LSH implementation having the same guarantees. In practice, the LSB-tree itself is also an effective index which consumes linear space, supports efficient updates, and provides accurate query results. In our experiments, the LSB-tree was faster than: (i) iDistance (a famous technique for exact NN search) by two orders of magnitude, and (ii) MedRank (a recent approximate method with nontrivial quality guarantees) by one order of magnitude, and meanwhile returned much better results. As a second step, we extend our LSB technique to solve another classic problem, called Closest Pair (CP) search, in high-dimensional space. The long-term challenge for this problem has been to achieve subquadratic running time at very high dimensionalities, which fails most of the existing solutions. We show that, using a LSB-forest, CP search can be accomplished in (worst-case) time significantly lower than the quadratic complexity, yet still ensuring very good quality. In practice, accurate answers can be found using just two LSB-trees, thus giving a substantial reduction in the space and running time. In our experiments, our technique was faster: (i) than distance browsing (a well-known method for solving the problem exactly) by several orders of magnitude, and (ii) than D-shift (an approximate approach with theoretical guarantees in low-dimensional space) by one order of magnitude, and at the same time, outputs better results.
- Research Article
2
- 10.1007/s00530-014-0444-3
- Dec 24, 2014
- Multimedia Systems
Nearest neighbor (NN) search in high-dimensional space plays a fundamental role in large-scale image retrieval. It seeks efficient indexing and search techniques, both of which are simultaneously essential for similarity search and semantic analysis. However, in recent years, there has been a rare breakthrough. Achievement of current techniques for NN search is far from satisfactory, especially for exact NN search. A recently proposed method, HB, addresses the exact NN search efficiently in high-dimensional space. It benefits from cluster-based techniques which can generate more compact representation of the data set than other techniques by exploiting interdimensional correlations. However, HB suffers from huge cost for lower bound computations and provides no further pruning scheme for points in candidate clusters. In this paper, we extend the HB method to address exact NN search in correlated, high-dimensional vector data sets extracted from large-scale image database by introducing two new pruning/selection techniques and we call it HB+. The first approach aims at selecting more quickly the subset of hyperplanes/clusters that must be considered. The second technique prunes irrelevant points in the selected subset of clusters. Performed experiments show the improvement of HB+ with respect to HB in terms of efficiency (I/O cost and CPU response time) and also demonstrate the superiority over other exact NN indexes.
- Research Article
19
- 10.1109/tmm.2021.3073811
- Mar 12, 2021
- IEEE Transactions on Multimedia
Nearest neighbor search and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> -nearest neighbor graph construction are two fundamental issues that arise from many disciplines such as multimedia information retrieval, data-mining, and machine learning. They become more and more imminent given the big data emerge in various fields in recent years. In this paper, a simple but effective solution both for approximate <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> -nearest neighbor search and approximate <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> -nearest neighbor graph construction is presented. These two issues are addressed jointly in our solution. On one hand, the approximate <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> -nearest neighbor graph construction is treated as a search task. Each sample along with its <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> -nearest neighbors is joined into the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> -nearest neighbor graph by performing the nearest neighbor search sequentially on the graph under construction. On the other hand, the built <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> -nearest neighbor graph is used to support <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> -nearest neighbor search. Since the graph is built online, the dynamic update on the graph, which is not possible for most of the existing solutions, is supported. This solution is feasible for various distance measures. Its effectiveness both as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> -nearest neighbor construction and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> -nearest neighbor search approaches is verified across different types of data in different scales, various dimensions, and under different metrics.
- Research Article
- 10.5194/isprs-archives-xlii-2-w1-69-2016
- Oct 26, 2016
- The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Abstract. Nearest Neighbour (NN) is one of the important queries and analyses for spatial application. In normal practice, spatial access method structure is used during the Nearest Neighbour query execution to retrieve information from the database. However, most of the spatial access method structures are still facing with unresolved issues such as overlapping among nodes and repetitive data entry. This situation will perform an excessive Input/Output (IO) operation which is inefficient for data retrieval. The situation will become more crucial while dealing with 3D data. The size of 3D data is usually large due to its detail geometry and other attached information. In this research, a clustered 3D hierarchical structure is introduced as a 3D spatial access method structure. The structure is expected to improve the retrieval of Nearest Neighbour information for 3D objects. Several tests are performed in answering Single Nearest Neighbour search and k Nearest Neighbour (kNN) search. The tests indicate that clustered hierarchical structure is efficient in handling Nearest Neighbour query compared to its competitor. From the results, clustered hierarchical structure reduced the repetitive data entry and the accessed page. The proposed structure also produced minimal Input/Output operation. The query response time is also outperformed compared to the other competitor. For future outlook of this research several possible applications are discussed and summarized.
- Research Article
51
- 10.2747/1548-1603.44.2.149
- Jun 1, 2007
- GIScience & Remote Sensing
The K nearest neighbor (KNN) method of image analysis is practical, relatively easy to implement, and is becoming one of the most popular methods for conducting forest inventory using remote sensing data. The KNN is often named K nearest neighbor classifier when it is used for classifying categorical variables, while KNN is called K nearest neighbor regression when it is applied for predicting noncategorical variables. As an instance-based estimation method, KNN has two problems: the selection of K values and computation cost. We address the problems of K selection by applying a new approach, which is the combination of the Kolmogorov-Smirnov (KS) test and cumulative distribution function (CDF) to determine the optimal K. Our research indicates that the KS tests and CDF are much more efficient for selecting K than cross-validation and bootstrapping, which are commonly used today. We use remote sensing data reduction techniques—such as principal components analysis, layer combination, and computation of a vegetation index—to save computation cost. We also consider the theoretical and practical implications of different K values in forest inventory.
- Research Article
1
- 10.14569/ijacsa.2020.0111131
- Jan 1, 2020
- International Journal of Advanced Computer Science and Applications
On image sharing websites, the images are associated with the tags. These tags play a very important role in an image retrieval system. So, it is necessary to recommend accurate tags for the images. Also, it is very important to design and develop an effective classifier that classifies images into various sematic categories which is the necessary step towards tag recommendation for the images. The performance of existing tag recommendation based on k nearest neighbor methods can be affected due to the number of k neighbors, distance measures, majority voting irrespective of the class and outlier present in the k-neighbors. To increase the accuracy of the classification and to overcome the issues in existing k nearest neighbor methods, the Harmonic Mean based Weighted Nearest Neighbor (HM-WNN) classifier is proposed for the classification of images. Given an input image, the HM-WNN determines k nearest neighbors from each category for color and texture features separately over the entire training set. The weights are assigned to the closest neighbor from each category so that reliable neighbors contribute more to the accuracy of classification. Finally, the categorical harmonic means of k nearest neighbors are determined and classify an input image into the category with a minimum mean. The experimentation is done on a self-generated dataset. The result shows that the HM-WNN gives 88.01% accuracy in comparison with existing k-nearest neighbor methods.
- Research Article
- 10.1080/02664763.2024.2414357
- Oct 12, 2024
- Journal of Applied Statistics
Missing data is a common problem in many domains that rely on data analysis. The k Nearest Neighbors imputation method has been widely used to address this issue, but it has limitations in accurately imputing missing values, especially for datasets with small pairwise correlations and small values of k. In this study, we proposed a method, Ranked k Nearest Neighbors imputation that uses a similar approach to k Nearest Neighbor, but utilizing the concept of Ranked set sampling to select the most relevant neighbors for imputation. Our results show that the proposed method outperforms the standard k nearest neighbor method in terms of imputation accuracy both in case of Missing Completely at Random and Missing at Random mechanism, as demonstrated by consistently lower MSIE and MAIE values across all datasets. This suggests that the proposed method is a promising alternative for imputing missing values in datasets with small pairwise correlations and small values of k. Thus, the proposed Ranked k Nearest Neighbor method has important implications for data imputation in various domains and can contribute to the development of more efficient and accurate imputation methods without adding any computational complexity to an algorithm.
- Conference Article
5
- 10.1109/sisap.2009.33
- Aug 1, 2009
Retrieving the \emph{k-nearest neighbors} of a query object is a basic primitive in similarity searching. A related, far less explored primitive is to obtain the dataset elements which would have the query object within their own \emph{k}-nearest neighbors, known as the \emph{reverse k-nearest neighbor} query. We already have indices and algorithms to solve \emph{k}-nearest neighbors queries in general metric spaces; yet, in many cases of practical interest they degenerate to sequential scanning. The naive algorithm for reverse \emph{k}-nearest neighbor queries has quadratic complexity, because the \emph{k}-nearest neighbors of all the dataset objects must be found; this is too expensive. Hence, when solving these primitives we can tolerate trading correctness in the solution for searching time. In this paper we propose an efficient approximate approach to solve these similarity queries with high retrieval rate. Then, we show how to use our approximate \emph{k}-nearest neighbor queries to construct (an approximation of) the \emph{k-nearest neighbor graph} when we have a fixed dataset. Finally, combining both primitives we show how to \emph{dynamically maintain} the approximate \emph{k}-nearest neighbor graph of the objects currently stored within the metric dataset, that is, considering both object insertions and deletions.
- Research Article
- 10.33899/iqjoss.2020.167392
- Dec 1, 2020
- IRAQI JOURNAL OF STATISTICAL SCIENCES
Thalassemia is considered a chronic disease, especially children from the first years of life, and the patient goes through stages over long periods, Data were collected for patients by real age and age through the bone, Therefore, a comparison will be made between the two cases. There are many statistical methods used to arrive at a classification of data, the method of nearest neighbor has been relied upon as a method of classification between societies. The method of classifying each observation depends on the three closest values on the basis of which the observation is placed into the correct group, the naturalness of the data was rather close, so it asked us to use a method that helps us to reach a better classification. The k the nearest neighbor is the best way to reach an optimal classification for such data. Classification by real age was better than classification by bone age using classification. Classification by actual age was better than classification by bone age using k nearest neighbor classification
- Research Article
4
- 10.1007/s10618-005-0030-6
- May 12, 2006
- Data Mining and Knowledge Discovery
Nearest Neighbor search is an important and widely used technique in a number of important application domains. In many of these domains, the dimensionality of the data representation is often very high. Recent theoretical results have shown that the concept of proximity or nearest neighbors may not be very meaningful for the high dimensional case. Therefore, it is often a complex problem to find good quality nearest neighbors in such data sets. Furthermore, it is also difficult to judge the value and relevance of the returned results. In fact, it is hard for any fully automated system to satisfy a user about the quality of the nearest neighbors found unless he is directly involved in the process. This is especially the case for high dimensional data in which the meaningfulness of the nearest neighbors found is questionable. In this paper, we address the complex problem of high dimensional nearest neighbor search from the user perspective by designing a system which uses effective cooperation between the human and the computer. The system provides the user with visual representations of carefully chosen subspaces of the data in order to repeatedly elicit his preferences about the data patterns which are most closely related to the query point. These preferences are used in order to determine and quantify the meaningfulness of the nearest neighbors. Our system is not only able to find and quantify the meaningfulness of the nearest neighbors, but is also able to diagnose situations in which the nearest neighbors found are truly not meaningful.
- Conference Article
38
- 10.1109/icde.2002.994777
- Aug 7, 2002
Nearest neighbor search is an important and widely used problem in a number of important application domains. In many of these domains, the dimensionality of the data representation is often very high. Recent theoretical results have shown that the concept of proximity or nearest neighbors may not be very meaningful for the high dimensional case. Therefore, it is often a complex problem to find good quality nearest neighbors in such data sets. Furthermore, it is also difficult to judge the value and relevance of the returned results. In fact, it is hard for any fully automated system to satisfy a user about the quality of the nearest neighbors found unless he is directly involved in the process. This is especially the case for high dimensional data in which the meaningfulness of the nearest neighbors found is questionable. We address the complex problem of high dimensional nearest neighbor search from the user perspective by designing a system which uses effective cooperation between the human and the computer. The system provides the user with visual representations of carefully chosen subspaces of the data in order to repeatedly elicit his preferences about the data patterns which are most closely related to the query point. These preferences are used in order to determine and quantify the meaningfulness of the nearest neighbors. Our system is not only able to find and quantify the meaningfulness of the nearest neighbors, but is also able to diagnose situations in which the nearest neighbors found are truly not meaningful.
- Conference Article
1
- 10.1109/iecon.2013.6699516
- Nov 1, 2013
Energy efficiency became more relevant recently. This also includes the construction of energy efficient buildings in terms of heat conservation and dissipation. For analysing the energy efficiency several mapping algorithms are proposed that map indoor environments with added thermal information. Also, several algorithms that generate virtual 3D models are recently presented. One of the main parts of these algoritms are nearest neighbour searching techniques. There are several algorithms that enables the use of nearest neighbour (NN) search. In this paper we present the assessment of R-tree based NN queries in the problem of scalar field mapping that maps a measured temperatures onto reconstructed 3D-mesh of indoor environment. The mesh is reconstructed from the point cloud recorded with 3D laser scanner and thermal imaging camera. We present the performance analysis of the R-tree based NN search with different R-tree types. Also, we present the quality of the scalar field mapping produced with employed R-tree based NN search techniques.
- New
- Research Article
- 10.1145/3771733
- Nov 6, 2025
- ACM Transactions on Database Systems
- Research Article
- 10.1145/3774753
- Nov 4, 2025
- ACM Transactions on Database Systems
- Research Article
- 10.1145/3774316
- Nov 1, 2025
- ACM Transactions on Database Systems
- Research Article
- 10.1145/3716378
- Oct 25, 2025
- ACM Transactions on Database Systems
- Research Article
- 10.1145/3771766
- Oct 14, 2025
- ACM Transactions on Database Systems
- Research Article
- 10.1145/3770577
- Oct 2, 2025
- ACM Transactions on Database Systems
- Research Article
- 10.1145/3734517
- Sep 30, 2025
- ACM Transactions on Database Systems
- Research Article
- 10.1145/3760773
- Sep 29, 2025
- ACM Transactions on Database Systems
- Research Article
- 10.1145/3764583
- Sep 29, 2025
- ACM Transactions on Database Systems
- Research Article
- 10.1145/3743130
- Jul 26, 2025
- ACM Transactions on Database Systems
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.