Abstract

We present the Maestro memory-on-logic 3D-IC architecture for coordinated parallel use of a plurality of systolic arrays (SAs) in performing deep neural network (DNN) inference. Maestro reduces under-utilization common for a single large SA by allowing parallel use of many smaller SAs on DNN weight matrices of varying shapes and sizes. In order to buffer immediate results in memory blocks (MBs) and provide coordinated high-bandwidth communication between SAs and MBs in transferring weights and results Maestro employs three innovations. (1) An SA on the logic die can access its corresponding MB on the memory die in short distance using 3D-IC interconnects, (2) through an efficient switch based on H-trees, an SA can access any MB with low latency, and (3) the switch can combine partial results from SAs in an elementwise fashion before writing back to a destination MB. We describe the Maestro architecture, including a circuit and layout design, detail scheduling of the switch, analyze system performance for real-time inference applications using input with batch size equal to one, and showcase applications for deep learning inference, with ShiftNet for computer vision and recent Transformer models for natural language processing. For the same total number of systolic cells, Maestro, with multiple smaller SAs, leads to 16x and 12x latency improvements over a single large SA on ShiftNet and Transformer, respectively. Compared to a floating-point GPU implementation of ShiftNet and Transform, a baseline Maestro system with 4,096 SAs (each with 8x8 systolic cells) provides significant latency improvements of 30x and 47x, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call