Design of Processing-“Inside”-Memory Optimized for DRAM Behaviors

Chang Hyun Kim,Seon Wook Kim,Il Park,Yoonah Paik,Won Jun Lee,Jongsun Park

doi:10.1109/access.2019.2924240

Abstract

The computing domain of today’s computer systems is moving very fast from arithmetic to data processing as data volumes grow exponentially. As a result, processing-in-memory (PIM) studies have been actively conducted to support the data processing in or near memory devices to address the limited bandwidth and high power consumption due to data movement between CPU/GPU and memory. However, most PIM studies so far have been conducted in a way that the processing units are designed only as an accelerator on the base die of 3D-stacked DRAM, not involved inside memory while not servicing the standard DRAM requests during the PIM execution. Therefore, in this paper, we show how to design and operate the PIM computing units inside DRAM by effectively coordinating with standard DRAM operations while achieving the full computing performance and minimizing the implementation cost. To make our goals, we extend a standard DRAM state diagram to depict the PIM behaviors in the same way as standard DRAM commands are scheduled and operated on the DRAM devices and exploit several levels of parallelism to overlap memory and computing operations. Also, we present how the entire architecture layers from applications to operating systems, memory controllers, and PIM devices should work together for the effective execution by applying our approaches to our experiment platform. In our HBM2-based experimental platform to include 16-cycle MAC (Multiply-and-Add) units and 8-cycle reducers for a matrix-vector multiplication, we achieved 406% and 35.2% faster performance by the all-bank and the per-bank schedulings, respectively, at ( $1024\times1024$ ) $\times $ ( $1024\times1$ ) 8-bit integer matrix-vector multiplication than the execution of only its operand burst reads assuming the external full DRAM bandwidth. It should be noted that the performance of the PIM on a base die of a 3D-stacked memory cannot be better than that provided by the full bandwidth in any case.

Highlights

The structure of the von Neumann has been followed by most computers today since it was first proposed [1]
We extend the DRAM state diagram where PIM commands are expressed with the standard DRAM commands
WORK When a computing unit is designed to be placed within DRAM, there are so many developmental issues as we discussed before

Summary

INTRODUCTION

The structure of the von Neumann has been followed by most computers today since it was first proposed [1]. All of them assumed that they did not service the standard DRAM requests during their computation even though handling a standard memory request during the PIM operation is essential for PIM to act as both a memory and an accelerator They did not present how the entire architecture layers from applications to operating systems, memory controllers, and PIM devices should work together for achieving significant performance with minimal implementation cost. Our PIM architecture exploits several levels of parallelisms by software and hardware as shown in Figure 2 by assuming 2-cycle MAC and 1-cycle reducer: 1) multi-way vector operations by a computing unit per bank (data-level parallelism), 2) independent bank-level execution to use full internal bandwidth in read and write operations (bank-level parallelism), 3) overlapping memory behaviors with computing ones (overlapping memory and compute operations), and 4) exploiting independent PIM operations informed by software (task-level parallelism).

DESIGN ISSUES AND OUR APPROACHES

EXPERIMENTAL DESIGN

PERFORMANCE EVALUATION

RELATED WORK

DISCUSSION AND FUTURE

Findings

VIII. CONCLUSION