Abstract
Regardless of the type of data, traditional Bloom filters treat each element of a set as a string, and by iterating every character of the string, they discretize all data randomly and uniformly. However, with the data size and dimension increases, these variants are inefficient. To better discretize vectors with high numerical dimensions, this paper improves the string hashes to integer hashes. Based on the integer hashes and a counter array, we propose a new variant—high-dimensional bloom filter (HDBF)—to extend the Bloom filter into high-dimensional spaces, which can represent and query numerical vectors of a big set with a low false positive probability. This paper theoretically analyzes the feasibility of the integer hashes on discretizing data and discusses the relationship of parameters of the HDBF. The experiments illustrate that, in high-dimensional numerical spaces, the HDBF shows better randomness on distribution and entropy than that of the counting Bloom filter. Compared with the parallel Bloom filters, for a fixed false positive probability, the HDBF displays time-space overheads, and is more suitable to deal with the numerical vectors with high dimensions.
Highlights
In high-dimensional spaces, exact search methods, such as kd-tree approaches [1] and Q-gram [2], are only suitable for small size vectors due to the very large computational resources
Since the distribution and entropy reflect the discrete state of data, to check whether high-dimensional Bloom filter (HDBF) can scatter the high-dimensional vectors into different integers, randomly and uniformly, this paper firstly compares the distribution and entropy of HDBF with Counting BF (CBF) on 3 datasets
Under 10 K query vectors, time10of CBF and HDBF are less than all dimensions into corresponding arrays, the initiation time will continue to increase with the parallel CBF (PBF)-HT and PBF-Bloom filter (BF), as shown in Figures 9 and 10
Summary
In high-dimensional spaces, exact search methods, such as kd-tree approaches [1] and Q-gram [2], are only suitable for small size vectors due to the very large computational resources. The integer-granularity locality-sensitive Bloom filter (ILBF) [32] filters objects with multiple integer distance granularities to shrink the distances and to reduce the FPP in MLBF All these schemes are based on the LSH, according to the central limited theorem, after mapping, the LSH shrinks most of elements of the set around the mean, which results in a high FPP in member query, especially around the Information 2017, 8, x mean. The modified hash functions can effectively discretize with numerical high dimensions, The experiments demonstrate that the HDBF has thevectors same performances as CBF as regards data discretization, which can efficiently deal with the vectors in high numerical dimensional spaces.filters uniformly and randomly. 3. HDBF outperforms CBF in false positive probability, query delay, memory costs, and especially in numerical high-dimensional spaces
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.