Instance selection for big data based on locally sensitive hashing and double-voting mechanism

Junhai Zhai,Yajie Huang

doi:10.1007/s43674-022-00033-z

Abstract

The increasing data volumes impose unprecedented challenges to traditional data mining in data preprocessing, learning, and analyzing, it has attracted much attention in designing efficient compressing, indexing and searching methods recently. Inspired by locally sensitive hashing (LSH), divide-and-conquer strategy, and double-voting mechanism, we proposed an iterative instance selection algorithm, which needs to run p rounds iteratively to reduce or eliminate the unwanted bias of the optimal solution by double-voting. In each iteration, the proposed algorithm partitions the big dataset into several subsets and distributes them to different computing nodes. In each node, the instances in local data subset are transformed into Hamming space by l hash function in parallel, and each instance is assigned to one of the l hash tables by the corresponding hash code, the instances with the same hash code are put into the same bucket. And then, a proportion of instances are randomly selected from each hash bucket in each hash table, and a subset is obtained. Thus, totally l subsets are obtained, which are used for voting to select the locally optimal instance subset. The process is repeated p times to obtain p subsets. Finally, the globally optimal instance subset is obtained by voting with the p subsets. The proposed algorithm is implemented with two open source big data platforms, Hadoop and Spark, and experimentally compared with three state-of-the-art methods on testing accuracy, compression ratio, and running time. The experimental results demonstrate that the proposed algorithm provides excellent performance and outperforms three baseline methods.

Full Text