Cell BE Processor Research Articles

Microwave tomography (MT) is a safe screening modality that can be used for breast cancer detection. The technique uses the dielectric property contrasts between different breast tissues at microwave frequencies to determine the existence of abnormalities. Our proposed MT approach is an iterative process that involves two algorithms: Finite-Difference Time-Domain (FDTD) and Genetic Algorithm (GA). It is a compute intensive problem: (i) the number of iterations can be quite large to detect small tumors; (ii) many fine-grained computations and discretizations of the object under screening are required for accuracy. In our earlier work, we developed a parallel algorithm for microwave tomography on CPU-based homogeneous, multi-core, distributed memory machines. The performance improvement was limited due to communication and synchronization latencies inherent in the algorithm. In this paper, we exploit the parallelism of microwave tomography on the Cell BE processor. Since FDTD is a numerical technique with regular memory accesses, intensive floating point operations and SIMD type operations, the algorithm can be efficiently mapped on the Cell processor achieving significant performance. The initial implementation of FDTD on Cell BE with 8 SPEs is 2.9 times faster than an eight node shared memory machine and 1.45 times faster than an eight node distributed memory machine. In this work, we modify the FDTD algorithm by overlapping computations with communications during asynchronous DMA transfers. The modified algorithm also orchestrates the computations to fully use data between DMA transfers to increase the computation-to-communication ratio. We see 54% improvement on 8 SPEs (27.9% on 1 SPE) for the modified FDTD in comparison to our original FDTD algorithm on Cell BE. We further reduce the synchronization latency between GA and FDTD by using mechanisms such as double buffering. We also propose a performance prediction model based on DMA transfers, number of instructions and operations, the processor frequency and DMA bandwidth. We show that the execution time from our prediction model is comparable (within 0.5 s difference) with the execution time of the experimental results on one SPE.

Read full abstract

On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. Program summary Program title: ITER-REF Catalogue identifier: AECO_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AECO_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 7211 No. of bytes in distributed program, including test data, etc.: 41 862 Distribution format: tar.gz Programming language: FORTRAN 77 Computer: desktop, server Operating system: Unix/Linux RAM: 512 Mbytes Classification: 4.8 External routines: BLAS (optional) Nature of problem: On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. Solution method: Mixed precision algorithms stem from the observation that, in many cases, a single precision solution of a problem can be refined to the point where double precision accuracy is achieved. A common approach to the solution of linear systems, either dense or sparse, is to perform the LU factorization of the coefficient matrix using Gaussian elimination. First, the coefficient matrix A is factored into the product of a lower triangular matrix L and an upper triangular matrix U. Partial row pivoting is in general used to improve numerical stability resulting in a factorization P A = L U , where P is a permutation matrix. The solution for the system is achieved by first solving L y = P b (forward substitution) and then solving U x = y (backward substitution). Due to round-off errors, the computed solution, x, carries a numerical error magnified by the condition number of the coefficient matrix A. In order to improve the computed solution, an iterative process can be applied, which produces a correction to the computed solution at each iteration, which then yields the method that is commonly known as the iterative refinement algorithm. Provided that the system is not too ill-conditioned, the algorithm produces a solution correct to the working precision. Running time: seconds/minutes

Read full abstract

Cell BE Processor Research Articles

Related Topics

Articles published on Cell BE Processor

Efficient sorting design on a novel embedded parallel computing architecture with unique memory access

Cell-Dock: high-performance protein–protein docking

The Implementation and Optimization of Irregular Application Task Models Based on the Cell BE Processor

A software pipelining algorithm of streaming applications with low buffer requirements

IPM based sparse LP solver on a heterogeneous processor

Microwave tomography for breast cancer detection on Cell broadband engine processors

Parallel Rendering and Animation of Subdivision Surfaces on the Cell BE Processor

Implementation of a linear programming solver on the Cell BE processor

Automatic Prefetch and Modulo Scheduling Transformations for the Cell BE Architecture

Optimal resource allocation and scheduling for the CELL BE platform

Parallel exact inference on the Cell Broadband Engine processor

Automatic parallelization of simulation code for equation-based models with software pipelining and measurements on three platforms

Optimized on-chip pipelining of memory-intensive computations on the cell BE

Accelerating scientific computations with mixed precision algorithms

Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy

Scalable Programming Models for Massively Multicore Processors

Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cell BE Processor Research Articles

Related Topics

Articles published on Cell BE Processor

Efficient sorting design on a novel embedded parallel computing architecture with unique memory access

Cell-Dock: high-performance protein–protein docking

The Implementation and Optimization of Irregular Application Task Models Based on the Cell BE Processor

A software pipelining algorithm of streaming applications with low buffer requirements

IPM based sparse LP solver on a heterogeneous processor

Microwave tomography for breast cancer detection on Cell broadband engine processors

Parallel Rendering and Animation of Subdivision Surfaces on the Cell BE Processor

Implementation of a linear programming solver on the Cell BE processor

Automatic Prefetch and Modulo Scheduling Transformations for the Cell BE Architecture

Optimal resource allocation and scheduling for the CELL BE platform

Parallel exact inference on the Cell Broadband Engine processor

Automatic parallelization of simulation code for equation-based models with software pipelining and measurements on three platforms

Optimized on-chip pipelining of memory-intensive computations on the cell BE

Accelerating scientific computations with mixed precision algorithms

Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy

Scalable Programming Models for Massively Multicore Processors

Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems