Abstract

Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application‐specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. Because it requires no modification to processor, cache, or bus designs, Impulse can be adopted in conventional systems. In this paper we describe the design of the Impulse architecture, and show how an Impulse memory system can improve the performance of memory‐bound scientific applications. For instance, Impulse decreases the running time of the NAS conjugate gradient benchmark by 67%. We expect that Impulse will also benefit regularly strided, memory‐bound applications of commercial importance, such as database and multimedia programs.

Highlights

  • Since 1987, microprocessor performance has improved at a rate of 55% per year; in contrast, DRAM latencies have improved by only 7% per year, and DRAM bandwidths by only 15–20% per year [14]

  • We describe the internal architecture of the Impulse memory controller, and explain the kinds of address remappings that it currently supports

  • A Page Table Unit that contains a simple ALU and Memory Controller TLB (MTLB) that map addresses in dense shadow space to pseudo-virtual and to physical addresses backed by DRAM, along with a small number of buffers to hold prefetched page table entries; and

Read more

Summary

Introduction

Since 1987, microprocessor performance has improved at a rate of 55% per year; in contrast, DRAM latencies have improved by only 7% per year, and DRAM bandwidths by only 15–20% per year [14]. Prefetching at the memory controller helps hide the latency of Impulse’s address translation, and is a useful optimization for non-remapped data. On a conventional memory system, each time the processor accesses a new diagonal element (A[i][i]), it requests a full cache line of contiguous physical memory (typically 32–128 bytes of data on modern systems). For remapped data, prefetching enables the controller to hide the costs associated with remapping (some remappings can require multiple DRAM accesses to fill a single cache line) With both prefetching and remapping, an Impulse controller significantly outperforms conventional memory systems. Some of the optimizations that we describe are not conceptually new, but the Impulse project is the first system that provides hardware support for them in general-purpose computer systems For both benchmarks, the use of Impulse optimizations significantly improves performance compared to a conventional memory controller.

Impulse architecture
Using Impulse
Hardware
Impulse optimizations
Sparse matrix-vector product
Tiled matrix algorithms
Performance
Dense matrix-matrix product
Impact of superscalar processors
Related work
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call