IBM Cell Processor Research Articles

The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the OpenCL specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, “Swan” for facilitating the conversion of an existing CUDA code to use the OpenCL model, as a means to aid programmers experienced with CUDA in evaluating OpenCL and alternative hardware. While the performance of equivalent OpenCL and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to OpenCL exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that OpenCL is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance. Program summary Program title: Swan Catalogue identifier: AEIH_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Public License version 2 No. of lines in distributed program, including test data, etc.: 17 736 No. of bytes in distributed program, including test data, etc.: 131 177 Distribution format: tar.gz Programming language: C Computer: PC Operating system: Linux RAM: 256 Mbytes Classification: 6.5 External routines: NVIDIA CUDA, OpenCL Nature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An alternative programming model promoted as an industry standard, OpenCL, provides similar capabilities to CUDA and is also supported on non-NVIDIA hardware (including multicore ×86 CPUs, AMD GPUs and IBM Cell processors). The adaptation of a program from CUDA to OpenCL is relatively straightforward but laborious. The Swan tool facilitates this conversion. Solution method: Swan performs a translation of CUDA kernel source code into an OpenCL equivalent. It also generates the C source code for entry point functions, simplifying kernel invocation from the host program. A concise host-side API abstracts the CUDA and OpenCL APIs. A program adapted to use Swan has no dependency on the CUDA compiler for the host-side program. The converted program may be built for either CUDA or OpenCL, with the selection made at compile time. Restrictions: No support for CUDA C++ features Running time: Nominal

Read full abstract

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summary Program title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.

Read full abstract

IBM Cell Processor Research Articles

Related Topics

Articles published on IBM Cell Processor

Minimizing write operation for multi-dimensional DSP applications via a two-level partition technique with complete memory latency hiding

Parallel Implementation of the Wideband DOA Algorithm on Single Core, Multicore, Gpu and Ibm Cell be Processor

Developing Systems for Real-Time Streaming Analysis

Implementation and evaluation of parallel FFT on Engineering and Scientific Computation Accelerator (ESCA) architecture

Multicore acceleration of Discrete Event System Specification systems

New Development of Parallel Conformal FDTD Method in Computational Electromagnetics Engineering

Performance analysis of multi‐level parallelism: inter‐node, intra‐node and hardware accelerators

Swan: A tool for porting CUDA programs to OpenCL

Hera-JVM

Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

A high-throughput screening approach to discovering good forms of biologically inspired visual representation.

The impact of IBM Cell technology on the programming paradigm in the context of computer systems for climate and weather models

Brain Derived Vision Algorithm on High Performance Architectures

Moving Scientific Codes to Multicore Microprocessor CPUs

The impact of accelerator processors for high-throughput molecular modeling and simulation

Development and use of pediatric frozen red cell packs.

Use and analysis of saline washed red blood cells.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

IBM Cell Processor Research Articles

Related Topics

Articles published on IBM Cell Processor

Minimizing write operation for multi-dimensional DSP applications via a two-level partition technique with complete memory latency hiding

Parallel Implementation of the Wideband DOA Algorithm on Single Core, Multicore, Gpu and Ibm Cell be Processor

Developing Systems for Real-Time Streaming Analysis

Implementation and evaluation of parallel FFT on Engineering and Scientific Computation Accelerator (ESCA) architecture

Multicore acceleration of Discrete Event System Specification systems

New Development of Parallel Conformal FDTD Method in Computational Electromagnetics Engineering

Performance analysis of multi‐level parallelism: inter‐node, intra‐node and hardware accelerators

Swan: A tool for porting CUDA programs to OpenCL

Hera-JVM

Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

A high-throughput screening approach to discovering good forms of biologically inspired visual representation.

The impact of IBM Cell technology on the programming paradigm in the context of computer systems for climate and weather models

Brain Derived Vision Algorithm on High Performance Architectures

Moving Scientific Codes to Multicore Microprocessor CPUs

The impact of accelerator processors for high-throughput molecular modeling and simulation

Development and use of pediatric frozen red cell packs.

Use and analysis of saline washed red blood cells.