Abstract

Estimating the similarity of sets of data is a common operation in computing. Minhash is widely used to estimate similarity by computing a signature for each set and then comparing their signatures. Therefore, signature comparison is an important part of similarity estimation. To make the comparison efficient, the size of the signature components is commonly set to the word size of the processor or to one half or one fourth of it. This enables efficient data manipulation and comparison but is not optimal in terms of storage. For example, 48-bit signatures may be more than enough in many applications but since that size cannot be easily manipulated by most processors, 64-bit signatures are used. This implies a 33.3% memory overhead. In this paper, Bitwise Signature Comparison (BSC), a method that enables the efficient comparison of signature components of any bitwidth is presented and evaluated. The results show that BSC achieves a similar speed to that of the traditional comparison implementation regardless of the size of the signature components. This enables the use of any signature component size enabling better trade-offs in the implementation of similarity estimation sketches.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.