Performance Analysis of Sketching Methods

Dinesh Maharjan

doi:10.3126/nccsrj.v2i1.60082

Abstract

One of the problems in data mining or machine learning is the high-dimensional dataset. MinHash can generate sketches of sparse datasets efficiently reducing the dimension to a few thousand. These sketches, then, can be used for different machine-learning applications. It takes O(kd) computations to generate k hash values for a data point with d non-zeros. The Sketch of a data point is the vector of k hash values. Weighted Minwise Hashing is another method to generate a sketch of size k in O(kp/d) computations. Here, p is the size of the universal set. Optimal Densification is the most efficient and accurate method as it can generate k hash values in mere O(d + k) computations. In this paper, we investigate the performance of Optimal Densification, Weighted Minwise Hashing, and Vanilla MinHash by performing two experiments. Firstly, we investigate Jaccard similarity estimation accuracy on six different synthetic datasets. Then, we perform one nearest neighbor classification (1NN) of four real datasets. Optimal Densification outperforms both Weighted Minwise Hashing and Vanilla MinHash in terms of accuracy and time taken to generate the sketches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Performance Analysis of Sketching Methods

Abstract

Talk to us

Similar Papers

More From: National College of Computer Studies Research Journal

Lead the way for us

Similar Papers

An outlier detection algorithm for categorical matrix-object data
Fuyuan Cao ... Jiye Liang
Applied Soft Computing | VOL. 104
Fuyuan Cao, et. al.Fuyuan Cao ... Jiye Liang
15 Feb 2021
Applied Soft Computing | VOL. 104

On the Parameterized Complexity of Clustering Incomplete Data into Subspaces of Small Rank
Robert Ganian ... Stefan Szeider
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Robert Ganian, et. al.Robert Ganian ... Stefan Szeider
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Using a Set of Triangle Inequalities to Accelerate K-means Clustering
Qiao Yu ... Jian-Jia Chen
-
Qiao Yu, et. al.Qiao Yu ... Jian-Jia Chen
01 Jan 2020
01 Jan 2020

Contributors
-
Interfaces | VOL. 41
--
01 Oct 2011
Interfaces | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Performance Analysis of Sketching Methods

Abstract

Talk to us

Similar Papers

More From: National College of Computer Studies Research Journal