Abstract
Discovering identical or near-identical items is urgently important in many applications such as Web crawling since it drastically reduces the text processing costs. Simhash is a widely used technique, able to attribute a bit-string identity to a text, such that similar texts have similar identities. In this study, a real-time solution for a simhash calculation in OpenCL is presented. We also show how it can be utilized by multi-CPUs, GPUs and FPGAs. As a result we indicate that the bottom line computation realized on the FPGA through OpenCL provides significant power advantages.
Highlights
Many applications can largely benefit from an effective duplicate or near-duplicate detection algorithm
We implemented simhash, a near duplicate detection algorithm in OpenCL. This code is ported to CPUs, Graphics Processing Units (GPUs) and FPGAs for comparison
In this study, following the conventions, we report the power utilization of GPU and FPGA taking into account their memory power consumption which is not case for multi-core CPU
Summary
Many applications can largely benefit from an effective duplicate or near-duplicate detection algorithm. In this study we suggest a method to treat the first phase of simhash inspired near-duplicate discovery, by using OpenCL in combination with CPUs, GPUs and FPGAs to rapidly process huge numbers of documents and calculate their simhash identities. We implemented simhash, a near duplicate detection algorithm in OpenCL This code is ported to CPUs, GPUs and FPGAs for comparison. The near-duplicate document detection usually includes two steps: (a) The simhash fingerprint calculation; and (b) the matching phase for identifying pairs of near-duplicate documents It is a hashing approach which maps a text document represented by terms to an identity bitstring. The basic implementation is based on a kernel which starts one parallel thread per document which in return calculates the simhash and maps the each term in the document to an output array. We used a kernel written in OpenCL
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.