Irregular Memory Access Patterns Research Articles

The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD co-processing units (SPEs) integrated on-chip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with non-uniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library. List ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of Software-Managed threads (SM-Threads). We apply this generic SPE work-partitioning technique to efficiently implement list ranking, and demonstrate substantial speedup in comparison to traditional cache-based microprocessors. For instance, on a 3.2 GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation. Our second case study, zlib, is a data compression/decompression library that is extensively used in both scientific as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm and Huffman data encoding. We design efficient parallel algorithms for these combinatorial kernels, and exploit concurrency at multiple levels on the Cell/B.E. processor. We also present a Cell/B.E. optimized implementation of gzip, a popular file-compression application based on the zlib library. For our Cell/B.E. implementation of gzip, we achieve an average speedup of 2.9 in compression over current workstations.

Read full abstract

Sparse matrix–vector multiplication is an important computational kernel that performs poorly on most modern processors due to a low compute-to-memory ratio and irregular memory access patterns. Optimization is difficult because of the complexity of cache-based memory systems and because performance is highly dependent on the non-zero structure of the matrix. The SPARSITY system is designed to address these problems by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. SPARSITY combines traditional techniques such as loop transformations with data structure transformations and optimization heuristics that are specific to sparse matrices. It provides a novel framework for selecting optimization parameters, such as block size, using a combination of performance models and search. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that register level optimizations are effective for matrices arising in certain scientific simulations, in particular finite-element problems. Cache level optimizations are important when the vector used in multiplication is larger than the cache size, especially for matrices in which the non-zero structure is random. For applications involving multiple vectors, reorganizing the computation to perform the entire set of multiplications as a single operation produces significant speedups. We describe the different optimizations and parameter selection techniques and evaluate them on several machines using over 40 matrices taken from a broad set of application domains. Our results demonstrate speedups of up to 4× for the single vector case and up to 10× for the multiple vector case.

Read full abstract

Irregular Memory Access Patterns Research Articles

Related Topics

Articles published on Irregular Memory Access Patterns

Data transformations enabling loop vectorization on multithreaded data parallel architectures

Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

Adaptive Scratch Pad Memory Management for Dynamic Behavior of Multimedia Applications

Compiler driven data layout optimization for regular/irregular array access patterns

High performance combinatorial algorithm design on the Cell Broadband Engine processor

Exploiting locality for irregular scientific codes

CacheFlow: Cache Optimizations for Data Driven Multithreading

Tolerating memory latency through push prefetching for pointer-intensive applications

Sparsity: Optimization Framework for Sparse Matrix Kernels

Combining compile-time and run-time support for efficient software distributed shared memory

An Unsymmetric-Pattern Multifrontal Method for Sparse LU Factorization

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Irregular Memory Access Patterns Research Articles

Related Topics

Articles published on Irregular Memory Access Patterns

Data transformations enabling loop vectorization on multithreaded data parallel architectures

Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

Adaptive Scratch Pad Memory Management for Dynamic Behavior of Multimedia Applications

Compiler driven data layout optimization for regular/irregular array access patterns

High performance combinatorial algorithm design on the Cell Broadband Engine processor

Exploiting locality for irregular scientific codes

CacheFlow: Cache Optimizations for Data Driven Multithreading

Tolerating memory latency through push prefetching for pointer-intensive applications

Sparsity: Optimization Framework for Sparse Matrix Kernels

Combining compile-time and run-time support for efficient software distributed shared memory

An Unsymmetric-Pattern Multifrontal Method for Sparse LU Factorization