Shared Cache Research Articles

The performance of a task running on a many-core with distributed shared last-level cache (LLC) strongly depends on two parameters: the power budget needed to guarantee thermally-safe operation and the LLC latency. The task's thread-to-core mapping determines both the parameters and needs to make a trade-off because both cannot be simultaneously optimal. Arrival and departure of tasks on a many-core deployed in an open system can change its state significantly in terms of available cores and power budgets. Task migrations can thereupon be used as a tool to keep the many-core operating at peak performance. Furthermore, the relative impacts of power budget and LLC latency on a task's performance may change with its different execution phases mandating its migration on-the-fly. We propose the first run-time algorithm <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">PCMig</i> that increases the performance of a many-core with distributed shared LLC by migrating tasks based on their phases and the many-core's state. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">PCMig</i> is based on a model that predicts the performance impact of migrations. We propose a performance prediction model based on a lightweight neural network (NN). To serve as a reference, we also propose an analytical model of the many-core that operates on CPI stacks. We demonstrate an NN-based model achieves a higher prediction accuracy at a lower overhead than an analytical model. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">PCMig</i> is based on the NN prediction model and results in an up to 7.3 percent increase in performance under a thermal constraint for mixed workloads compared to architecture-aware state-of-the-art (up to 20 percent increase for individual applications). This is achieved with a run-time overhead of less than 0.5 percent.

GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests. Specifically, concurrent memory requests accessing contiguous memory space are coalesced into warp-wide accesses. To support such large accesses to L1 cache with low latency, the size of L1 cache line is no smaller than that of warp-wide accesses. However, such L1 cache architecture cannot always be efficiently utilized when applications generate many memory requests with irregular access patterns especially due to branch and memory divergences that make requests uncoalesced and small. Furthermore, unlike L1 cache, the shared memory of GPUs is not often used in many applications, which essentially depends on programmers. In this article, we propose Elastic-Cache, which can efficiently support both fine- and coarse-grained L1 cache line management for applications with both regular and irregular memory access patterns to improve the L1 cache efficiency. Specifically, it can store 32- or 64-byte words in non-contiguous memory space to a single 128-byte cache line. Furthermore, it neither requires an extra memory structure nor reduces the capacity of L1 cache for tag storage, since it stores auxiliary tags for fine-grained L1 cache line managements in the shared memory space that is not fully used in many applications. To improve the bandwidth utilization of L1 cache with Elastic-Cache for fine-grained accesses, we further propose Elastic-Plus to issue 32-byte memory requests in parallel, which can reduce the processing latency of memory instructions and improve the throughput of GPUs. Our experiment result shows that Elastic-Cache improves the geometric-mean performance of applications with irregular memory access patterns by 104% without degrading the performance of applications with regular memory access patterns. Elastic-Plus outperforms Elastic-Cache and improves the performance of applications with irregular memory access patterns by 131%.

Shared Cache Research Articles

Related Topics

Articles published on Shared Cache

Neural Network-based Performance Prediction for Task Migration on S-NUCA Many-Cores

Linear Network Coded Wireless Caching in Cloud Radio Access Network

Fast distributed compilation and testing of large C++ projects

Fundamental Limits of Coded Caching With Multiple Antennas, Shared Caches and Uncoded Prefetching

REAL

Cooperative Inter-Domain Cache Sharing for Information-Centric Networking via a Bargaining Game Approach

Enhanced Methods of Mobile Cache Sharing and Pre-fetching for Required Web Contents

Modeling and Analysis of a Shared Edge Caching System for Connected Cars and Industrial IoT-Based Applications

A Skewed Multi-banked Cache for Many-core Vector Processors

Efficient Data Transfer in a Heterogeneous Multicore-Based CE Systems using Cache Performance Optimization

An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

Time-sensitivity-aware shared cache architecture for multi-core embedded systems

FOS: a low-power cache organization for multicores

SCaN-Mob: An opportunistic caching strategy to support producer mobility in named data wireless networking

Efficient File Delivery for Coded Prefetching in Shared Cache Networks With Multiple Requests Per User

Efficient Algorithms for Coded Multicasting in Heterogeneous Caching Networks.

Dynamic directory table with victim cache: on-demand allocation of directory entries for active shared cache blocks

A Gaussian Set Sampling Model for Efficient Shared Cache Profiling on Multi-Cores

A Low-power Shared Cache Design with Modified PID Controller for Efficient Multicore Embedded Systems

Filter router: An enhanced router design for efficient stacked shared cache network

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Shared Cache Research Articles

Related Topics

Articles published on Shared Cache

Neural Network-based Performance Prediction for Task Migration on S-NUCA Many-Cores

Linear Network Coded Wireless Caching in Cloud Radio Access Network

Fast distributed compilation and testing of large C++ projects

Fundamental Limits of Coded Caching With Multiple Antennas, Shared Caches and Uncoded Prefetching

REAL

Cooperative Inter-Domain Cache Sharing for Information-Centric Networking via a Bargaining Game Approach

Enhanced Methods of Mobile Cache Sharing and Pre-fetching for Required Web Contents

Modeling and Analysis of a Shared Edge Caching System for Connected Cars and Industrial IoT-Based Applications

A Skewed Multi-banked Cache for Many-core Vector Processors

Efficient Data Transfer in a Heterogeneous Multicore-Based CE Systems using Cache Performance Optimization

An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

Time-sensitivity-aware shared cache architecture for multi-core embedded systems

FOS: a low-power cache organization for multicores

SCaN-Mob: An opportunistic caching strategy to support producer mobility in named data wireless networking

Efficient File Delivery for Coded Prefetching in Shared Cache Networks With Multiple Requests Per User

Efficient Algorithms for Coded Multicasting in Heterogeneous Caching Networks.

Dynamic directory table with victim cache: on-demand allocation of directory entries for active shared cache blocks

A Gaussian Set Sampling Model for Efficient Shared Cache Profiling on Multi-Cores

A Low-power Shared Cache Design with Modified PID Controller for Efficient Multicore Embedded Systems

Filter router: An enhanced router design for efficient stacked shared cache network