Memory Access Latency Research Articles

Over the past two decades, the storage capacity and access bandwidth of main memory have improved tremendously, by 128x and 20x, respectively. These improvements are mainly due to the continuous technology scaling of DRAM (dynamic random-access memory), which has been used as the physical substrate for main memory. In stark contrast with capacity and bandwidth, DRAM latency has remained almost constant, reducing by only 1.3x in the same time frame. Therefore, long DRAM latency continues to be a critical performance bottleneck in modern systems. Increasing core counts, and the emergence of increasingly more data-intensive and latency-critical applications further stress the importance of providing low-latency memory accesses. In this dissertation, we identify three main problems that contribute significantly to long latency of DRAM accesses. To address these problems, we present a series of new techniques. Our new techniques significantly improve both system performance and energy efficiency. We also examine the critical relationship between supply voltage and latency in modern DRAM chips and develop new mechanisms that exploit this voltage-latency trade-o to improve energy efficiency. First, while bulk data movement is a key operation in many applications and operating systems, contemporary systems perform this movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow o -chip channel. The use of this narrow channel for bulk data movement results in high latency and high energy consumption. This dissertation introduces a new DRAM design, Low-cost Inter-linked SubArrays (LISA), which provides fast and energy-efficient bulk data movement across sub- arrays in a DRAM chip. We show that the LISA substrate is very powerful and versatile by demonstrating that it efficiently enables several new architectural mechanisms, including low-latency data copying, reduced DRAM access latency for frequently-accessed data, and reduced preparation latency for subsequent accesses to a DRAM bank. Second, DRAM needs to be periodically refreshed to prevent data loss due to leakage. Unfortunately, while DRAM is being refreshed, a part of it becomes unavailable to serve memory requests, which degrades system performance. To address this refresh interference problem, we propose two access-refresh parallelization techniques that enable more overlap- ping of accesses with refreshes inside DRAM, at the cost of very modest changes to the memory controllers and DRAM chips. These two techniques together achieve performance close to an idealized system that does not require refresh. Third, we find, for the first time, that there is significant latency variation in accessing different cells of a single DRAM chip due to the irregularity in the DRAM manufacturing process. As a result, some DRAM cells are inherently faster to access, while others are inherently slower. Unfortunately, existing systems do not exploit this variation and use a fixed latency value based on the slowest cell across all DRAM chips. To exploit latency variation within the DRAM chip, we experimentally characterize and understand the behavior of the variation that exists in real commodity DRAM chips. Based on our characterization, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism to reduce DRAM latency by categorizing the DRAM cells into fast and slow regions, and accessing the fast regions with a reduced latency, thereby improving system performance significantly. Our extensive experimental characterization and analysis of latency variation in DRAM chips can also enable development of other new techniques to improve performance or reliability. Fourth, this dissertation, for the first time, develops an understanding of the latency behavior due to another important factor { supply voltage, which significantly impacts DRAM performance, energy consumption, and reliability. We take an experimental approach to understanding and exploiting the behavior of modern DRAM chips under different supply voltage values. Our detailed characterization of real commodity DRAM chips demonstrates that memory access latency reduces with increasing supply voltage. Based on our characterization, we propose Voltron, a new mechanism that improves system energy efficiency by dynamically adjusting the DRAM supply voltage based on a performance model. Our extensive experimental data on the relationship between DRAM supply voltage, latency, and reliability can further enable developments of other new mechanisms that improve latency, energy efficiency, or reliability. The key conclusion of this dissertation is that augmenting DRAM architecture with simple and low-cost features, and developing a better understanding of manufactured DRAM chips together leads to significant memory latency reduction as well as energy efficiency improvement. We hope and believe that the proposed architectural techniques and detailed experimental data on real commodity DRAM chips presented in this dissertation will enable developments of other new mechanisms to improve the performance, energy efficiency, or reliability of future memory systems.

Read full abstract

Purpose: (1) To perform phase space (PS) based source modeling for Tomotherapy and Varian TrueBeam 6 MV Linacs, (2) to examine the accuracy and performance of the ARCHER Monte Carlo code on a heterogeneous computing platform with Many Integrated Core coprocessors (MIC, aka Xeon Phi) and GPUs, and (3) to explore the software micro-optimization methods. Methods: The patient-specific source of Tomotherapy and Varian TrueBeam Linacs was modeled using the PS approach. For the helical Tomotherapy case, the PS data were calculated in our previous study (Su et al. 2014 41(7) Medical Physics). For the single-view Varian TrueBeam case, we analytically derived them from the raw patient-independent PS data in IAEA's database, partial geometry information of the jaw and MLC as well as the fluence map. The phantom was generated from DICOM images. The Monte Carlo simulation was performed by ARCHER-MIC and GPU codes, which were benchmarked against a modified parallel DPM code. Software micro-optimization was systematically conducted, and was focused on SIMD vectorization of tight for-loops and data prefetch, with the ultimate goal of increasing 512-bit register utilization and reducing memory access latency. Results: Dose calculation was performed for two clinical cases, a Tomotherapy-based prostate cancer treatment and a TrueBeam-based left breast treatment. ARCHER was verified against the DPM code. The statistical uncertainty of the dose to the PTV was less than 1%. Using double-precision, the total wall time of the multithreaded CPU code on a X5650 CPU was 339 seconds for the Tomotherapy case and 131 seconds for the TrueBeam, while on 3 5110P MICs it was reduced to 79 and 59 seconds, respectively. The single-precision GPU code on a K40 GPU took 45 seconds for the Tomotherapy dose calculation. Conclusion: We have extended ARCHER, the MIC and GPU-based Monte Carlo dose engine to Tomotherapy and Truebeam dose calculations.

Read full abstract

Memory Access Latency Research Articles

Related Topics

Articles published on Memory Access Latency

Energy-Aware Data Allocation With Hybrid Memory for Mobile Cloud Systems

Performance-Monitoring-Based Traffic-Aware Virtual Machine Deployment on NUMA Systems

Characterizing and optimizing Java-based HPC applications on Intel many-core architecture

Understanding and Improving the Latency of DRAM-Based Memory Systems

FPGASW: Accelerating Large-Scale Smith-Waterman Sequence Alignment Application with Backtracking on FPGA Linear Systolic Array.

Accelerated parametric chamfer alignment using a parallel, pipelined GPU realization

SALAD: Achieving Symmetric Access Latency with Asymmetric DRAM Architecture

LA-LLC: Inter-Core Locality-Aware Last-Level Cache to Exploit Many-to-Many Traffic in GPGPUs

Cooperative Caching for GPUs

SPMPool

Hybrid L2 NUCA Design and Management Considering Data Access Latency, Energy Efficiency, and Storage Lifetime

A Hybrid Parallel Solving Algorithm on GPU for Quasi-Tridiagonal System of Linear Equations

A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems

Optimization of atmospheric transport models on HPC platforms

Studying Inter-Warp Divergence Aware Execution on GPUs

Accelerating dependent cache misses with an enhanced memory controller

TU-AB-BRC-10: Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison of GPU and MIC Computing Accelerators

A comprehensive reconfigurable computing approach to memory wall problem of large graph computation

A composable worst case latency analysis for multi-rank DRAM devices under open row policy

High-level Video Analytics PC Subsystem Using SoC With Heterogeneous Multicore Architecture

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Memory Access Latency Research Articles

Related Topics

Articles published on Memory Access Latency

Energy-Aware Data Allocation With Hybrid Memory for Mobile Cloud Systems

Performance-Monitoring-Based Traffic-Aware Virtual Machine Deployment on NUMA Systems

Characterizing and optimizing Java-based HPC applications on Intel many-core architecture

Understanding and Improving the Latency of DRAM-Based Memory Systems

FPGASW: Accelerating Large-Scale Smith-Waterman Sequence Alignment Application with Backtracking on FPGA Linear Systolic Array.

Accelerated parametric chamfer alignment using a parallel, pipelined GPU realization

SALAD: Achieving Symmetric Access Latency with Asymmetric DRAM Architecture

LA-LLC: Inter-Core Locality-Aware Last-Level Cache to Exploit Many-to-Many Traffic in GPGPUs

Cooperative Caching for GPUs

SPMPool

Hybrid L2 NUCA Design and Management Considering Data Access Latency, Energy Efficiency, and Storage Lifetime

A Hybrid Parallel Solving Algorithm on GPU for Quasi-Tridiagonal System of Linear Equations

A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems

Optimization of atmospheric transport models on HPC platforms

Studying Inter-Warp Divergence Aware Execution on GPUs

Accelerating dependent cache misses with an enhanced memory controller

TU-AB-BRC-10: Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison of GPU and MIC Computing Accelerators

A comprehensive reconfigurable computing approach to memory wall problem of large graph computation

A composable worst case latency analysis for multi-rank DRAM devices under open row policy

High-level Video Analytics PC Subsystem Using SoC With Heterogeneous Multicore Architecture