Silent-PIM: Realizing the Processing-in-Memory Computing With Standard Memory Requests

Chang Hyun Kim,Il Park,Seok Young Kim,Seon Wook Kim,Yoonah Paik,Won Jun Lee,Kiyong Kwon

doi:10.1109/tpds.2021.3065365

Abstract

The Deep Neural Network (DNN), Recurrent Neural Network (RNN) applications, rapidly becoming attractive to the market, process a large amount of low-locality data; thus, the memory bandwidth limits their peak performance. Therefore, many data centers actively adapt high-bandwidth memory like HBM2/HBM2E to resolve the problem. However, this approach would not provide a complete solution since it still transfers the data from the memory to the computing unit. Thus, processing-in-memory (PIM), which performs the computation inside memory, has attracted attention. However, most previous methods require the modification or the extension of core pipelines and memory system components like memory controllers, making the practical implementation of PIM very challenging and expensive in development. In this article, we propose a Silent-PIM that performs the PIM computation with standard DRAM memory requests; thus, requiring no hardware modifications and allowing the PIM memory device to perform the computation while servicing non-PIM applications' memory requests. We can achieve our design goal by preserving the standard memory request behaviors and satisfying the DRAM standard timing requirements. In addition, using standard memory requests makes it possible to use DMA as a PIM's offloading engine, resulting in processing the PIM memory requests fast and making a core perform other tasks. We compared the performance of three Long Short-Term Memory models (LSTM) kernels on real platforms, such as the Silent-PIM modeled on the FPGA, GPU, and CPU. For (p ×512) ×(512 ×2048) matrix multiplication with a batch size p varying from 1 to 128, the Silent-PIM performed up to 16.9x and 24.6x faster than GPU and CPU, respectively, p=1, which was the case without having any data reuse. At p=128, the highest data reuse case, the GPU performance was the highest, but the PIM performance was still higher than the CPU execution. Similarly, at (p ×2048) element-wise multiplication and addition, where there was no data reuse, the Silent-PIM always achieved higher than both CPU and GPU. It also showed that when the PIM's EDP performance was superior to the others in all the cases having no data reuse.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Silent-PIM: Realizing the Processing-in-Memory Computing With Standard Memory Requests

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Mar 12, 2021
Citations: 15

Similar Papers

Mid-Term Load Forecasting by LSTM Model of Deep Learning with Hyper-Parameter Tuning
Ashish Prajesh ... Satish Sharma
-
Ashish Prajesh, et. al.Ashish Prajesh ... Satish Sharma
01 Jan 2023
01 Jan 2023

A data-driven strategy using long short term memory models and reinforcement learning to predict building electricity consumption
Xinlei Zhou ... Zhenjun Ma
Applied Energy | VOL. 306
Xinlei Zhou, et. al.Xinlei Zhou ... Zhenjun Ma
02 Nov 2021
Applied Energy | VOL. 306

Evaluation of data preprocessing and feature selection process for prediction of hourly PM10 concentration using long short-term memory models
İpek Aksangür ... Caner Erden
Environmental Pollution | VOL. 311
İpek Aksangür, et. al.İpek Aksangür ... Caner Erden
17 Aug 2022
Environmental Pollution | VOL. 311

Sea Surface Temperature and High Water Temperature Occurrence Prediction Using a Long Short-Term Memory Model
Minkyu Kim ... Jonghwa Kim
Remote Sensing | VOL. 12
Minkyu Kim, et. al.Minkyu Kim ... Jonghwa Kim
07 Nov 2020
Remote Sensing | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Silent-PIM: Realizing the Processing-in-Memory Computing With Standard Memory Requests

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems