Regular Access Patterns Research Articles

GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests. Specifically, concurrent memory requests accessing contiguous memory space are coalesced into warp-wide accesses. To support such large accesses to L1 cache with low latency, the size of L1 cache line is no smaller than that of warp-wide accesses. However, such L1 cache architecture cannot always be efficiently utilized when applications generate many memory requests with irregular access patterns especially due to branch and memory divergences that make requests uncoalesced and small. Furthermore, unlike L1 cache, the shared memory of GPUs is not often used in many applications, which essentially depends on programmers. In this article, we propose Elastic-Cache, which can efficiently support both fine- and coarse-grained L1 cache line management for applications with both regular and irregular memory access patterns to improve the L1 cache efficiency. Specifically, it can store 32- or 64-byte words in non-contiguous memory space to a single 128-byte cache line. Furthermore, it neither requires an extra memory structure nor reduces the capacity of L1 cache for tag storage, since it stores auxiliary tags for fine-grained L1 cache line managements in the shared memory space that is not fully used in many applications. To improve the bandwidth utilization of L1 cache with Elastic-Cache for fine-grained accesses, we further propose Elastic-Plus to issue 32-byte memory requests in parallel, which can reduce the processing latency of memory instructions and improve the throughput of GPUs. Our experiment result shows that Elastic-Cache improves the geometric-mean performance of applications with irregular memory access patterns by 104% without degrading the performance of applications with regular memory access patterns. Elastic-Plus outperforms Elastic-Cache and improves the performance of applications with irregular memory access patterns by 131%.

A high-throughput memory-efficient arithmetic coder architecture for the set partitioning in hierarchical trees (SPIHT) image compression is proposed based on a simple context model in this paper. The architecture benefits from various optimizations performed at different levels of arithmetic coding from higher algorithm abstraction to lower circuits' implementations. First, the complex context model used by software is mitigated by designing a simple context model, which just uses the brother nodes' states in the coding zerotree of SPIHT to form context symbols for the arithmetic coding. The simple context model results in a regular access pattern during reading the wavelet transform coefficients, which is convenient to the hardware implementation, but at a cost of slight performance loss. Second, in order to avoid rescanning the wavelet transform coefficients, a breadth first search SPIHT without lists algorithm is used instead of SPIHT with lists algorithm. Especially, the coding bit-planes of each zero tree are processed in parallel. Third, an out-of-order execution mechanism for different types of context is proposed that can allocate the context symbol to the idle arithmetic coding core with a different order that of the input. For the balance of the input rate of the wavelet coefficients, eight arithmetic coders are replicated in the compression system. And in one arithmetic coder, there exists four cores to process different contexts. Fourth, several dedicated circuits are designed to further improve the throughput of the architecture. The common bit detection (CBD) circuit is used for unrolling the renormalization stage of the arithmetic coding. The carry look-ahead adder (CLA) and fast multiplier-divider are also employed to shorten the critical path in the architecture. Moreover, an adaptive clock switch mechanism can stop some invalid bit-planes' clock for the power saving purpose according to the input images. Experimental results demonstrate that the proposed architecture attains a throughput of 902.464 Mb/s at its maximum and achieves savings of 20.08% in power consumption over full bit-planes coding scheme based on field-programmable gate arrays (FPGAs).

Regular Access Patterns Research Articles

Related Topics

Articles published on Regular Access Patterns

Multiply-and-Fire: An Event-Driven Sparse Neural Network Accelerator

CompAct

An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

Embedded DRAM-Based Memory Customization for Low-Cost FFT Processor Design

Pragma Directed Shared Memory Centric Optimizations on GPUs

Compiler-Driven Software Speculation for Thread-Level Parallelism

Asynchronous memory access chaining

Static analysis of the worst-case memory performance for irregular codes with indirections

VLSI Architecture of Arithmetic Coder Used in SPIHT

A Compile/Run-time Environment for the Automatic Transformation of Linked List Data Structures

Compiler driven data layout optimization for regular/irregular array access patterns

Automated and accurate cache behavior analysis for codes with irregular access patterns

Analytical modeling of codes with arbitrary data-dependent conditional structures

MEMORY HIERARCHY PERFORMANCE PREDICTION FOR BLOCKED SPARSE ALGORITHMS

Combining compile-time and run-time support for efficient software distributed shared memory

Modeling set associative caches behavior for irregular computations

Visualizing working sets

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Regular Access Patterns Research Articles

Related Topics

Articles published on Regular Access Patterns

Multiply-and-Fire: An Event-Driven Sparse Neural Network Accelerator

CompAct

An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

Embedded DRAM-Based Memory Customization for Low-Cost FFT Processor Design

Pragma Directed Shared Memory Centric Optimizations on GPUs

Compiler-Driven Software Speculation for Thread-Level Parallelism

Asynchronous memory access chaining

Static analysis of the worst-case memory performance for irregular codes with indirections

VLSI Architecture of Arithmetic Coder Used in SPIHT

A Compile/Run-time Environment for the Automatic Transformation of Linked List Data Structures

Compiler driven data layout optimization for regular/irregular array access patterns

Automated and accurate cache behavior analysis for codes with irregular access patterns

Analytical modeling of codes with arbitrary data-dependent conditional structures

MEMORY HIERARCHY PERFORMANCE PREDICTION FOR BLOCKED SPARSE ALGORITHMS

Combining compile-time and run-time support for efficient software distributed shared memory

Modeling set associative caches behavior for irregular computations

Visualizing working sets