Abstract

Machine Learning in the form of Artificial Neural Networks (ANNs) has gained traction over the last few years especially in applications such as image recognition and speech recognition. These particular applications typically employ a subset of ANNs known as Convolutional Neural Networks (CNNs) which re-use parameters and thus reduce main memory bandwidth. However, there are other types of ANN that do not provide reuse opportunities such as autoencoders and Long Short-term memory (LSTM). It is generally accepted that dynamic random-access memory (DRAM) is required to store the ANN parameters of useful sized ANNs. To achieve a given performance, CNN-specific implementations utilize cache-like structures using static random-access memory (SRAM) which mimimizes accesses to the slower DRAM. Most research has focused on implementing CNNs but because of their extensive use of SRAM have both ANN size restrictions and performance degradation when used in applications that utilize other types of ANN. This work considers embedded applications employing multiple disparate generic ANNs which, assuming there are limited reuse opportunities in the form of re-use or batch processing, will require usable memory bandwidth on the order of tens of Tbit/s. This work provides support to Deep Neural Networks (DNNs) that do not provide ANN parameter reuse and suggests that these types of applications will require that all ANN parameters in main memory be accessed in real-time. This work coins the phrase “goldilocks bandwidth” when applied to ANN systems where the system provides the bandwidth required to read all ANN parameters at a real-time rate. This work employs pure 3DIC technology along with a proposed custom 3D-DRAM which exposes an entire page over a very wide databus (Fig 3). The 3DIC system die stack (Fig 1) includes the 3D-DRAM, a system manager layer and a Processing Engine (PE) layer collectively known as a Sub-System Column (SSC) (Fig 4). The targeted 3D-DRAM, the Tezzaron DiRAM4 [1]employs multiple memory array layers in conjunction with a control and IO layer and provides 64 separate vaults each providing 1 Gbit of storage which along with the suggested customizations provides this work up to 65 Tbit/s.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.