Thermal-Aware Design Space Exploration of 3-D Systolic ML Accelerators

Rahul Mathur,Ajay Krishna Ananda Kumar,Jaydeep P Kulkarni,Lizy John

doi:10.1109/jxcdc.2021.3092436

Rahul Mathur, Ajay Krishna Ananda Kumar + Show 2 more

Open Access

https://doi.org/10.1109/jxcdc.2021.3092436

Copy DOI

Abstract

Machine learning (ML) accelerators have a broad spectrum of use cases that pose different requirements on accelerator design for latency, energy, and area. In the case of systolic array-based ML accelerators, this puts different constraints on processing element (PE) array dimensions and SRAM buffer sizes. The 3-D integration packs more compute or memory in the same 2-D footprint, which can be utilized to build more powerful or energy-efficient accelerators. However, 3-D also expands the design space of ML accelerators by additionally including different possible ways of partitioning the PE array and SRAM buffers among the vertical tiers. Moreover, the partitioning approach may also have different thermal implications. This work provides a systematic framework for performing system-level design space exploration of 3-D systolic accelerators. Using this framework, different 3-D-partitioned accelerator configurations are proposed and evaluated. The 3-D-stacked accelerator designs are modeled using the hybrid wafer bonding technique with a 1.44- <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mu \text{m}$ </tex-math></inline-formula> pitch of 3-D connection. Results show that different partitioning of the systolic array and SRAM buffers in a four-tier 3-D configuration can lead to either 1.1– <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3.9\times $ </tex-math></inline-formula> latency reduction or 1– <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3\times $ </tex-math></inline-formula> energy reduction compared to the baseline design of the same 2-D area footprint. It is also shown that by carefully organizing the systolic array and SRAM tiers using logic over memory, the temperature rise with 3-D across benchmarks can be limited to 6 °C.

Highlights

M ACHINE learning (ML) algorithms are composed of both computationally and memory-intensive matrix multiplication operations
ENERGY Intuitively, it can be said that stacking multiple SRAM tiers would lower the data transfers to main memory (DRAM) transfers bringing down the total energy, especially for memory-limited networks
Configuration 7 operating on partitioned output channel requires input feature maps to be duplicated in the SRAMs, causing multiple DRAM accesses to fetch the same input data leading to high total energy

Summary

Introduction

M ACHINE learning (ML) algorithms are composed of both computationally and memory-intensive matrix multiplication operations. Systolic array architectures [1] achieve high throughput with modest bandwidth for matrix multiplication operations and make a good choice for ML acceleration. The 2-D enhanced architectures [9] place dies side-by-side and interconnect them through media, such as a silicon interposer [10] or embedded bridge [11], [12], to achieve higher interconnect densities compared to mainstream packages. As 3-D technologies evolve, increasingly finer pitches of 3-D connections become viable [18], [19] This opens interesting possibilities for designers to partition and fold designs onto multiple tiers [20], [21]. The interposer may contain through-silicon vias (TSVs) [24] that are essentially holes etched out in the silicon wafer and filled with a conductive metal such as copper

Methods

Results

Conclusion