Abstract

Discovering identical or near-identical items is urgently important in many applications such as Web crawling since it drastically reduces the text processing costs. Simhash is a widely used technique, able to attribute a bit-string identity to a text, such that similar texts have similar identities. In this study, a real-time solution for a simhash calculation in OpenCL is presented. We also show how it can be utilized by multi-CPUs, GPUs and FPGAs. As a result we indicate that the bottom line computation realized on the FPGA through OpenCL provides significant power advantages.

Highlights

  • Many applications can largely benefit from an effective duplicate or near-duplicate detection algorithm

  • We implemented simhash, a near duplicate detection algorithm in OpenCL. This code is ported to CPUs, Graphics Processing Units (GPUs) and FPGAs for comparison

  • In this study, following the conventions, we report the power utilization of GPU and FPGA taking into account their memory power consumption which is not case for multi-core CPU

Read more

Summary

Introduction

Many applications can largely benefit from an effective duplicate or near-duplicate detection algorithm. In this study we suggest a method to treat the first phase of simhash inspired near-duplicate discovery, by using OpenCL in combination with CPUs, GPUs and FPGAs to rapidly process huge numbers of documents and calculate their simhash identities. We implemented simhash, a near duplicate detection algorithm in OpenCL This code is ported to CPUs, GPUs and FPGAs for comparison. The near-duplicate document detection usually includes two steps: (a) The simhash fingerprint calculation; and (b) the matching phase for identifying pairs of near-duplicate documents It is a hashing approach which maps a text document represented by terms to an identity bitstring. The basic implementation is based on a kernel which starts one parallel thread per document which in return calculates the simhash and maps the each term in the document to an output array. We used a kernel written in OpenCL

Experiments
30 MB 1 MB None
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.