Abstract

Regardless of the type of data, traditional Bloom filters treat each element of a set as a string, and by iterating every character of the string, they discretize all data randomly and uniformly. However, with the data size and dimension increases, these variants are inefficient. To better discretize vectors with high numerical dimensions, this paper improves the string hashes to integer hashes. Based on the integer hashes and a counter array, we propose a new variant—high-dimensional bloom filter (HDBF)—to extend the Bloom filter into high-dimensional spaces, which can represent and query numerical vectors of a big set with a low false positive probability. This paper theoretically analyzes the feasibility of the integer hashes on discretizing data and discusses the relationship of parameters of the HDBF. The experiments illustrate that, in high-dimensional numerical spaces, the HDBF shows better randomness on distribution and entropy than that of the counting Bloom filter. Compared with the parallel Bloom filters, for a fixed false positive probability, the HDBF displays time-space overheads, and is more suitable to deal with the numerical vectors with high dimensions.

Highlights

  • In high-dimensional spaces, exact search methods, such as kd-tree approaches [1] and Q-gram [2], are only suitable for small size vectors due to the very large computational resources

  • Since the distribution and entropy reflect the discrete state of data, to check whether high-dimensional Bloom filter (HDBF) can scatter the high-dimensional vectors into different integers, randomly and uniformly, this paper firstly compares the distribution and entropy of HDBF with Counting BF (CBF) on 3 datasets

  • Under 10 K query vectors, time10of CBF and HDBF are less than all dimensions into corresponding arrays, the initiation time will continue to increase with the parallel CBF (PBF)-HT and PBF-Bloom filter (BF), as shown in Figures 9 and 10

Read more

Summary

Introduction

In high-dimensional spaces, exact search methods, such as kd-tree approaches [1] and Q-gram [2], are only suitable for small size vectors due to the very large computational resources. The integer-granularity locality-sensitive Bloom filter (ILBF) [32] filters objects with multiple integer distance granularities to shrink the distances and to reduce the FPP in MLBF All these schemes are based on the LSH, according to the central limited theorem, after mapping, the LSH shrinks most of elements of the set around the mean, which results in a high FPP in member query, especially around the Information 2017, 8, x mean. The modified hash functions can effectively discretize with numerical high dimensions, The experiments demonstrate that the HDBF has thevectors same performances as CBF as regards data discretization, which can efficiently deal with the vectors in high numerical dimensional spaces.filters uniformly and randomly. 3. HDBF outperforms CBF in false positive probability, query delay, memory costs, and especially in numerical high-dimensional spaces

Work Mechanism and Structure
Performances
Dataset and Settings
Distribution and Entropy
Entropies of of and for
Memory Costs and Latency
Compared with andare
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call