Cache Access Research Articles

SummaryIneffective column‐directional cache memory access has become a bottleneck for efficient two‐dimensional (2‐D) data processing utilizing extended single instruction multiple data (SIMD) instructions. To solve this problem, we propose a cache memory with tile (column and row directions) and line (row direction) accessibility for efficient 2‐D data processing. 2‐D data access to the proposed cache memory is enabled via a hardware‐based multi‐mode address translation unit that eliminates the overhead of software‐based address calculation. To reduce the hardware overhead of the proposed cache, we propose a tag memory reduction method that replaces multiple tiles with an aligned tile set (RATS) in the cache. To verify the feasibility of the proposed cache, an LSI layout of a SIMD‐based general purpose‐oriented datapath embedding the proposed cache is designed in a 2.5×5 mm2 area using 0.18‐μm CMOS technology. Under a 3.9‐ns clock period (250 MHz), the read latency is limited to 3 clock cycles, which is the same as that for the conventional cache memory. Using the RATS method, the entire hardware overhead of the proposed cache is reduced to only 7% of that required for a conventional cache. In addition, simulation results for the proposed cache indicate a considerable reduction of L1 and L2 cache confliction misses compared with a conventional cache in power‐of‐two matrix size due to the column‐directional address stride being sufficiently smaller than page size. Therefore, the proposed cache provides efficient column‐directional parallel access as same as row‐directional parallel access so that it enables efficient SIMD operation requiring no transposition in matrix multiplication (MM). For LU decomposition (LUD), the proposed cache can provide almost the same performance to the column‐major–based LUD program as that to the row‐major–based LUD program. These results show that the proposed cache does not restrict our freedom in selecting either row‐ or column‐major order coding.

Read full abstract

Abstract Chip Multiprocessors (CMP) have emerged during last decades as a very attractive solution in using the ever-increasing on-chip transistor count. However, classical parallelization techniques failed to fully exploit parallelization from existing sequential applications due to false data dependencies. This paper focuses on the Thread-level Speculation (TLS) technique, an alternative way to exploit the transistor budget in a CMP. With TLS, even possibly data dependent threads can run in parallel as long as the semantics of the sequential execution is preserved. A special hardware support monitors the actual data dependencies between threads at run time and, if they are violated, misspeculation effects are undone usually through replay. This kind of system is known as speculative CMP. However, the TLS mechanism requires complex protocols that integrate cache coherence and speculation to maintain program order among multiple versions of data. Current TLS protocol evaluations are usually inadequate because they are not done low-level enough. A realistic evaluation of speculative CMPs requires either to be performed on a real hardware or very detailed cycle-accurate simulator models. In this paper we are particularly focused on a low-level evaluation of the write-invalidate TLS protocol Speculation Integrated with Snoopy Coherence (SISC) protocol proposed in [1] . This evaluation relies on cycle-level simulation environment with detailed cycle-level cache memories, cache controller and system bus. On top of this, a speculative four core architecture is simulated and three new modules (Scheduler, Squash Arbiter and Supplier Arbiter) are provided to support low-level implementation of the SISC protocol. The overall cost of the SISC protocol is evaluated by means of CACTI tool for the three different domains: the access latency cost, the area cost, and the power cost. The evaluation goal was to keep the cache access time to remain below cycle latency as well as the area and power overheads below an acceptable budget overhead. The SISC protocol has been compared against regular MESI-based architecture in both 32-bit and 64-bit versions. We kept the cache access time below the cycle latency, and we managed to keep both data cache area and static power overheads respectively below 32% and 35%.

Read full abstract

Cache Access Research Articles

Related Topics

Articles published on Cache Access

IBOM: An Integrated and Balanced On-Chip Memory for High Performance GPGPUs

Fast media caching for geo-distributed data centers

SwSpTRSV

A Request‐Based Handover Strategy Using NDN for 5G

A Hybrid Architecture With Low Latency Interfaces Enabling Dynamic Cache Management

Caching AP Selection and Channel Allocation in Wireless Caching Networks: A Binary Concurrent Interference Minimizing Game Solution

CAIRO

Tile/line access cache memory based on a multi‐level Z‐order tiling data layout

Enhance the Performance of Associative Memory by Using New Methods

Cost-Effective Enhancement on Both Yield and Reliability for Cache Designs Based on Performance Degradation Tolerance

Word- and Partition-Level Write Variation Reduction for Improving Non-Volatile Cache Lifetime

Low-level implementation of the SISC protocol for thread-level speculation on a multi-core architecture

A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems

Addressing isolation challenges of non-blocking caches for multicore real-time systems

Optical Overlay NUCA

Segment access-aware dynamic semantic cache in cloud computing environment

Towards "Full Containerization" in Containerized Network Function Virtualization

Towards "Full Containerization" in Containerized Network Function Virtualization

Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices

Shaping data for application performance and energy optimization in dynamic data view framework

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cache Access Research Articles

Related Topics

Articles published on Cache Access

IBOM: An Integrated and Balanced On-Chip Memory for High Performance GPGPUs

Fast media caching for geo-distributed data centers

SwSpTRSV

A Request‐Based Handover Strategy Using NDN for 5G

A Hybrid Architecture With Low Latency Interfaces Enabling Dynamic Cache Management

Caching AP Selection and Channel Allocation in Wireless Caching Networks: A Binary Concurrent Interference Minimizing Game Solution

CAIRO

Tile/line access cache memory based on a multi‐level Z‐order tiling data layout

Enhance the Performance of Associative Memory by Using New Methods

Cost-Effective Enhancement on Both Yield and Reliability for Cache Designs Based on Performance Degradation Tolerance

Word- and Partition-Level Write Variation Reduction for Improving Non-Volatile Cache Lifetime

Low-level implementation of the SISC protocol for thread-level speculation on a multi-core architecture

A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems

Addressing isolation challenges of non-blocking caches for multicore real-time systems

Optical Overlay NUCA

Segment access-aware dynamic semantic cache in cloud computing environment

Towards "Full Containerization" in Containerized Network Function Virtualization

Towards "Full Containerization" in Containerized Network Function Virtualization

Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices

Shaping data for application performance and energy optimization in dynamic data view framework