Multi-core CPU Implementation Research Articles

Quantitative computed tomography (QCT) of the lungs plays an increasing role in identifying sub-phenotypes of pathologies previously lumped into broad categories such as chronic obstructive pulmonary disease and asthma. Methods for image matching and linking multiple lung volumes have proven useful in linking structure to function and in the identification of regional longitudinal changes. Here, we seek to improve the accuracy of image matching via the use of a symmetric multi-level non-rigid registration employing an inverse consistent (IC) transformation whereby images are registered both in the forward and reverse directions. To develop the symmetric method, two similarity measures, the sum of squared intensity difference (SSD) and the sum of squared tissue volume difference (SSTVD), were used. The method is based on a novel generic mathematical framework to include forward and backward transformations, simultaneously, eliminating the need to compute the inverse transformation. Two implementations were used to assess the proposed method: a two-dimensional (2-D) implementation using synthetic examples with SSD, and a multi-core CPU and graphics processing unit (GPU) implementation with SSTVD for three-dimensional (3-D) human lung datasets (six normal adults studied at total lung capacity (TLC) and functional residual capacity (FRC)). Success was evaluated in terms of the IC transformation consistency serving to link TLC to FRC. 2-D registration on synthetic images, using both symmetric and non-symmetric SSD methods, and comparison of displacement fields showed that the symmetric method gave a symmetrical grid shape and reduced IC errors, with the mean values of IC errors decreased by 37%. Results for both symmetric and non-symmetric transformations of human datasets showed that the symmetric method gave better results for IC errors in all cases, with mean values of IC errors for the symmetric method lower than the non-symmetric methods using both SSD and SSTVD. The GPU version demonstrated an average of 43 times speedup and ~5.2 times speedup over the single-threaded and 12-threaded CPU versions, respectively. Run times with the GPU were as fast as 2min. The symmetric method improved the inverse consistency, aiding the use of image registration in the QCT-based evaluation of the lung.

Read full abstract

Mining complex patterns with hierarchical structures becomes more and more important to understand the underlying information in large and unstructured databases. When compared with a set-mining problem or a string-mining problem, the computation complexity to recognize a pattern with hierarchical structure, and the large associated search space, make hierarchical pattern mining (HPM) extremely expensive on conventional processor architectures. We propose a flexible, hardware-accelerated framework for mining hierarchical patterns with Apriori-based algorithms, which leads to multi-pass pruning strategies but exposes massive parallelism. Under this framework, we implemented two widely used HPM techniques, sequential pattern mining (SPM) and disjunctive rule mining (DRM) on the Automata Processor (AP), a hardware implementation of non-deterministic finite automata (NFAs). Two automaton-design strategies for matching and counting different types of hierarchical patterns, called linear design and reduction design, are proposed in this paper. To generalize automaton structure for SPM, the linear design strategy is proposed by flattening sequential patterns to plain strings to produce automaton design space and to minimize the overhead of reconfiguration. Up to 90\(\times \) and 29\(\times \) speedups are achieved by the AP-accelerated algorithm on six real-world datasets, when compared with the optimized multicore CPU and GPU GSP implementations, respectively. The proposed CPU-AP solution also outperforms the state-of-the-art PrefixSpan and SPADE algorithms on a multicore CPU by up to 452\(\times \) and 49\(\times \) speedups. The AP advantage grows further with larger datasets. For DRM, the reduction design strategy is adopted by applying reduction operation of AND, with on-chip Boolean units, on several parallel sub-structures for recognizing disjunctive items. This strategy allows implicit OR reduction on alternative items within a disjunctive item by utilizing bit-wise parallelism feature of the on-chip state units. The experiments show up to 614\(\times \) speedups of the proposed CPU-AP DRM solution over a sequential CPU algorithm on two real-world datasets. The experiments also show significant increase of CPU matching-and-counting time when increasing d-rule size or the number of alternative items. However, in practical cases, the AP solution runs hundreds of times faster in matching and counting than the CPU solution, and keeps constant processing time despite the increasing complexity of disjunctive rules.

Read full abstract

Multi-core CPU Implementation Research Articles

Articles published on Multi-core CPU Implementation

Parallel Algorithm on Multicore Processor and Graphics Processing Unit for the Optimization of Electric Vehicle Recharge Scheduling

FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-level Synthesis

Accelerating the 2+2+1 method for estimating local traveltime operators in nonlinear beamforming using GPU graphics cards

GPUTreeShap: massively parallel exact calculation of SHAP scores for tree ensembles

GpuZoo: Cost-effective estimation of gene regulatory networks using the Graphics Processing Unit.

GPU-enabled searches for periodic signals of unknown shape

FL-MISR: fast large-scale multi-image super-resolution for computed tomography based on multi-GPU acceleration

Multi‐GPU room response simulation with hardware raytracing

Multi GPU parallelization of maximum likelihood expectation maximization method for digital rock tomography data

Multi-GPU Design and Performance Evaluation of Homomorphic Encryption on GPU Clusters

Binned k-d Tree Construction for Sparse Volume Data on Multi-Core and GPU Systems.

Stabilized variational formulation of an oldroyd-B fluid flow equations on a Graphic Processing Unit (GPU) architecture

CuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs

DeepCR: Cosmic Ray Rejection with Deep Learning

DeepCR: Cosmic Ray Rejection with Deep Learning

Meshless voronoi on the GPU

Ray Tracer based rendering solution for large scale fluid rendering

A GPU-based symmetric non-rigid image registration method in human lung.

OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology

Hierarchical Pattern Mining with the Automata Processor

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multi-core CPU Implementation Research Articles

Articles published on Multi-core CPU Implementation

Parallel Algorithm on Multicore Processor and Graphics Processing Unit for the Optimization of Electric Vehicle Recharge Scheduling

FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-level Synthesis

Accelerating the 2+2+1 method for estimating local traveltime operators in nonlinear beamforming using GPU graphics cards

GPUTreeShap: massively parallel exact calculation of SHAP scores for tree ensembles

GpuZoo: Cost-effective estimation of gene regulatory networks using the Graphics Processing Unit.

GPU-enabled searches for periodic signals of unknown shape

FL-MISR: fast large-scale multi-image super-resolution for computed tomography based on multi-GPU acceleration

Multi‐GPU room response simulation with hardware raytracing

Multi GPU parallelization of maximum likelihood expectation maximization method for digital rock tomography data

Multi-GPU Design and Performance Evaluation of Homomorphic Encryption on GPU Clusters

Binned k-d Tree Construction for Sparse Volume Data on Multi-Core and GPU Systems.

Stabilized variational formulation of an oldroyd-B fluid flow equations on a Graphic Processing Unit (GPU) architecture

CuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs

DeepCR: Cosmic Ray Rejection with Deep Learning

DeepCR: Cosmic Ray Rejection with Deep Learning

Meshless voronoi on the GPU

Ray Tracer based rendering solution for large scale fluid rendering

A GPU-based symmetric non-rigid image registration method in human lung.

OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology

Hierarchical Pattern Mining with the Automata Processor