Abstract
One of the problems in data mining or machine learning is the high-dimensional dataset. MinHash can generate sketches of sparse datasets efficiently reducing the dimension to a few thousand. These sketches, then, can be used for different machine-learning applications. It takes O(kd) computations to generate k hash values for a data point with d non-zeros. The Sketch of a data point is the vector of k hash values. Weighted Minwise Hashing is another method to generate a sketch of size k in O(kp/d) computations. Here, p is the size of the universal set. Optimal Densification is the most efficient and accurate method as it can generate k hash values in mere O(d + k) computations. In this paper, we investigate the performance of Optimal Densification, Weighted Minwise Hashing, and Vanilla MinHash by performing two experiments. Firstly, we investigate Jaccard similarity estimation accuracy on six different synthetic datasets. Then, we perform one nearest neighbor classification (1NN) of four real datasets. Optimal Densification outperforms both Weighted Minwise Hashing and Vanilla MinHash in terms of accuracy and time taken to generate the sketches.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: National College of Computer Studies Research Journal
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.