Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL

Ercan Canhasi

doi:10.3844/jcssp.2018.699.704

Abstract

Discovering identical or near-identical items is urgently important in many applications such as Web crawling since it drastically reduces the text processing costs. Simhash is a widely used technique, able to attribute a bit-string identity to a text, such that similar texts have similar identities. In this study, a real-time solution for a simhash calculation in OpenCL is presented. We also show how it can be utilized by multi-CPUs, GPUs and FPGAs. As a result we indicate that the bottom line computation realized on the FPGA through OpenCL provides significant power advantages.

Highlights

Many applications can largely benefit from an effective duplicate or near-duplicate detection algorithm
We implemented simhash, a near duplicate detection algorithm in OpenCL. This code is ported to CPUs, Graphics Processing Units (GPUs) and FPGAs for comparison
In this study, following the conventions, we report the power utilization of GPU and FPGA taking into account their memory power consumption which is not case for multi-core CPU

Summary

Introduction

Many applications can largely benefit from an effective duplicate or near-duplicate detection algorithm. In this study we suggest a method to treat the first phase of simhash inspired near-duplicate discovery, by using OpenCL in combination with CPUs, GPUs and FPGAs to rapidly process huge numbers of documents and calculate their simhash identities. We implemented simhash, a near duplicate detection algorithm in OpenCL This code is ported to CPUs, GPUs and FPGAs for comparison. The near-duplicate document detection usually includes two steps: (a) The simhash fingerprint calculation; and (b) the matching phase for identifying pairs of near-duplicate documents It is a hashing approach which maps a text document represented by terms to an identity bitstring. The basic implementation is based on a kernel which starts one parallel thread per document which in return calculates the simhash and maps the each term in the document to an output array. We used a kernel written in OpenCL

Experiments

30 MB 1 MB None

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Computer Science	Publication Date: May 1, 2018
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computer Science

Lead the way for us

Similar Papers

Design and Implementation of English Intelligent Communication Platform Based on Similarity Algorithm
Yujie Chai
Complexity | VOL. 2021
Yujie ChaiYujie Chai
29 Mar 2021
Complexity | VOL. 2021

Comparison of Computerized and Manual Assessment of Dyslexia Children Test using Text Processing
Arni Muarifah Amri ... Adhi Dharma Wibawa
-
Arni Muarifah Amri, et. al.Arni Muarifah Amri ... Adhi Dharma Wibawa
20 Jul 2022
20 Jul 2022

Website Review: Review of Patient-Oriented Websites on Eosinophilic Esophagitis
Nikola Natov
Gastroenterology | VOL. 149
Nikola NatovNikola Natov
22 Aug 2015
Gastroenterology | VOL. 149

HypereiDoc – An XML Based Framework Supporting Cooperative Text Editions,
... Zsolt Hernáth
-
, et. al. ... Zsolt Hernáth
01 Jan 2008
01 Jan 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computer Science