Abstract

Advances in machine learning (ML) have ignited hardware innovations for efficient execution of the ML models many of which are memory-bound (e.g., long short-term memories, multi-level perceptrons, and recurrent neural networks). Specifically, inference using these ML models with small batches, as would be the case at the Cloud edge, has little reuse of the large filters and is deeply memory-bound. Simultaneously, processing-in or -near memory (PIM or PNM) is promising unprecedented high-bandwidth connection between compute and memory. Fortunately, the memory-bound ML models are a good fit for PIM. We focus on digital PIM which provides higher bandwidth than PNM and does not incur the reliability issues of analog PIM. Previous PIM and PNM approaches advocate full processor cores which do not conform to PIM's severe area and power constraints. We describe Newton, a major DRAM maker's upcoming accelerator-in-memory (AiM) product for machine learning, which makes the following contributions: (1) To satisfy PIM's area constraints, Newton (a) places a minimal compute of only multiply-accumulate units and buffers in the DRAM which avoids the full-core area and power overheads of previous work and thus makes PIM feasible for the first time, and (b) employs a DRAM-like interface for the host to issue commands to the PIM compute. The PIM compute is rate-matched to the internal DRAM bandwidth and employs a non-intuitive, global input vector buffer shared by the entire channel to capture input reuse while amortizing buffer area cost. To the host, Newton's interface is indistinguishable from regular DRAM without any offloading overheads and PIM/non-PIM mode switching, and with the same deterministic latencies even for floating-point commands. (2) To prevent the PIM-host interface from becoming a bottleneck, we include three optimizations: commands which gang multiple compute operations both within a bank and across banks; complex, multi-step compute commands - both of which save critical command bandwidth; and targeted reduction of t <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">FAW</sub> overhead. (3) To capture output vector reuse with reasonable buffering, Newton employs an unusually-wide interleaved layout for the matrix. Our simulations running state-of-the-art neural networks show that building on a realistic HBM2E-like DRAM, Newton achieves 10x and 54x average speedup over a non-PIM system with infinite compute that perfectly uses the external DRAM bandwidth and a realistic GPU, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.