Single-threaded CPU Implementation Research Articles

We develop GPU adaptations of the Aho-Corasick and multipattern Boyer-Moore string matching algorithms for the two cases GPU-to-GPU (input to the algorithms is initially in GPU memory and the output is left in GPU memory) and host-to-host (input and output are in the memory of the host CPU). For the GPU-to-GPU case, we consider several refinements to a base GPU implementation and measure the performance gain from each refinement. For the host-to-host case, we analyze two strategies to communicate between the host and the GPU and show that one is optimal with respect to runtime while the other requires less device memory. This analysis is done for GPUs with one I/O channel to the host as well as those with 2. Experiments conducted on an NVIDIA Tesla GT200 GPU that has 240 cores running off of a Xeon 2.8 GHz quad-core host CPU show that, for the GPU-to-GPU case, our Aho-Corasick GPU adaptation achieves a speedup between 8.5 and 9.5 relative to a single-thread CPU implementation and between 2.4 and 3.2 relative to the best multithreaded implementation. For the host-to-host case, the GPU AC code achieves a speedup of 3.1 relative to a single-threaded CPU implementation. However, the GPU is unable to deliver any speedup relative to the best multithreaded code running on the quad-core host. In fact, the measured speedups for the latter case ranged between 0.74 and 0.83. Early versions of our multipattern Boyer-Moore adaptations ran 7 to 10 percent slower than corresponding versions of the AC adaptations and we did not refine the multipattern Boyer-Moore codes further.

Suffix Array (SA) is a data structure formed by sorting the suffixes of a string into lexicographic order. SAs have been used in a variety of applications, most notably in pattern matching and Burrows-Wheeler Transform (BWT) based lossless data compression. SAs have also become the data structure of choice for many, if not all, string processing problems to which suffix tree methodology is applicable. Over the last two decades researchers have proposed many suffix array construction algorithm (SACAs). We do a systematic study of the main classes of SACAs with the intent of mapping them onto a data parallel architecture like the GPU. We conclude that skew algorithm [12], a linear time recursive algorithm, is the best candidate for GPUs as all its phases can be efficiently mapped to a data parallel hardware. Our OpenCL implementation of skew algorithm achieves a throughput of up to 25 MStrings/sec and a speedup of up to 34x and 5.8x over a single threaded CPU implementation using a discrete GPU and APU respectively. We also compare our OpenCL implementation against the fastest known CPU implementation based on induced copying and achieve a speedup of up to 3.7x. Using SA we construct BWT on GPU and achieve a speedup of 11x over the fastest known BWT on GPU. Suffix arrays are often augmented with the longest common prefix (LCP) information. We design a novel high-performance parallel algorithm for computing LCP on the GPU. Our GPU implementation of LCP achieves a speedup of up to 25x and 4.3x on discrete GPU and APU respectively.

Single-threaded CPU Implementation Research Articles

Articles published on Single-threaded CPU Implementation

GPU-to-GPU and Host-to-Host Multipattern String Matching on a GPU

Highly-Parallel GPU Architecture for Lossy Hyperspectral Image Compression

Parallel suffix array and least common prefix for the GPU

Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study

Cuda Parallel Implementation of Image Reconstruction Algorithm for Positron Emission Tomography

GPGPU implementation of growing neural gas: Application to 3D scene reconstruction

Simulating cortical networks on heterogeneous multi-GPU systems

Ultrafast convolution/superposition using tabulated and exponential kernels on GPU

GPU accelerated Monte Carlo simulations of lattice spin models

Graphics Hardware Accelerated Continuous Collision Detection Between Deformable Objects

Real-world comparison of CPU and GPU implementations of SNPrank: a network analysis tool for GWAS

On developing B-spline registration algorithms for multi-core processors

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Single-threaded CPU Implementation Research Articles

Articles published on Single-threaded CPU Implementation

GPU-to-GPU and Host-to-Host Multipattern String Matching on a GPU

Highly-Parallel GPU Architecture for Lossy Hyperspectral Image Compression

Parallel suffix array and least common prefix for the GPU

Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study

Cuda Parallel Implementation of Image Reconstruction Algorithm for Positron Emission Tomography

GPGPU implementation of growing neural gas: Application to 3D scene reconstruction

Simulating cortical networks on heterogeneous multi-GPU systems

Ultrafast convolution/superposition using tabulated and exponential kernels on GPU

GPU accelerated Monte Carlo simulations of lattice spin models

Graphics Hardware Accelerated Continuous Collision Detection Between Deformable Objects

Real-world comparison of CPU and GPU implementations of SNPrank: a network analysis tool for GWAS

On developing B-spline registration algorithms for multi-core processors