Processing Elements Level Research Articles

Although recent advances in resistive random access memory (ReRAM)-based accelerator designs for deep convolutional neural networks (CNNs) offer energy-efficiency improvements over CMOS-based accelerators, they have a large number of energy consuming data transactions. In this paper, we propose MAX2, a multi-tile ReRAM accelerator framework for supporting multiple CNN topologies, that maximizes on-chip data reuse and reduces on-chip bandwidth to minimize energy consumption due to data movement. Building upon the fact that a large filter can be built with a stack of smaller ( $3\times 3$ ) filters, we design every tile with nine processing elements (PEs). Each PE consists of multiple ReRAM subarrays to compute the dot product. The PEs operate in a systolic fashion, thereby maximizing input feature map reuse and minimizing interconnection cost. MAX2 chooses the data size granularity in the systolic array in conjunction with weight duplication to achieve very high area utilization without requiring additional peripheral circuits. We provide a detailed energy and area breakdown of each component at the PE level, tile level, and system level. The system-level evaluation in 32-nm node on several VGG-network benchmarks shows that the MAX2 can improve computation efficiency (TOPs/s/mm2) by $2.5\times $ and energy efficiency (TOPs/s/W) by $5.2\times $ compared with a state-of-the-art ReRAM-based accelerator.

Read full abstract

The rapid evolution of Cloud-based services and the growing interest in deep learning (DL)-based applications is putting increasing pressure on hyperscalers and general purpose hardware designers to provide more efficient and scalable systems. Cloud-based infrastructures must consist of more energy efficient components. The evolution must take place from the core of the infrastructure (i.e., data centers (DCs)) to the edges (Edge computing) to adequately support new/future applications. Adaptability/elasticity is one of the features required to increase the performance-to-power ratios. Hardware-based mechanisms have been proposed to support system reconfiguration mostly at the processing elements level, while fewer studies have been carried out regarding scalable, modular interconnected sub-systems. In this paper, we propose a scalable Software Defined Network-on-Chip (SDNoC)-based architecture. Our solution can easily be adapted to support devices ranging from low-power computing nodes placed at the edge of the Cloud to high-performance many-core processors in the Cloud DCs, by leveraging on a modular design approach. The proposed design merges the benefits of hierarchical network-on-chip (NoC) topologies (via fusing the ring and the 2D-mesh topology), with those brought by dynamic reconfiguration (i.e., adaptation). Our proposed interconnect allows for creating different types of virtualised topologies aiming at serving different communication requirements and thus providing better resource partitioning (virtual tiles) for concurrent tasks. To further allow the software layer controlling and monitoring of the NoC subsystem, a few customised instructions supporting a data-driven program execution model (PXM) are added to the processing element’s instruction set architecture (ISA). In general, the data-driven programming and execution models are suitable for supporting the DL applications. We also introduce a mechanism to map a high-level programming language embedding concurrent execution models into the basic functionalities offered by our SDNoC for easing the programming of the proposed system. In the reported experiments, we compared our lightweight reconfigurable architecture to a conventional flattened 2D-mesh interconnection subsystem. Results show that our design provides an increment of the data traffic throughput of % and a reduction of of the average packet latency, compared to the flattened 2D-mesh topology connecting the same number of processing elements (PEs) (up to 1024 cores). Similarly, power and resource (on FPGA devices) consumption is also low, confirming good scalability of the proposed architecture.

Read full abstract

Processing Elements Level Research Articles

Articles published on Processing Elements Level

Reconfigurability, Why It Matters in AI Tasks Processing: A Survey of Reconfigurable AI Chips

MAX2: An ReRAM-Based Neural Network Accelerator That Maximizes Data Reuse and Area Utilization

Towards a Scalable Software Defined Network-on-Chip for Next Generation Cloud.

A Game-Theoretic Approach for Run-Time Distributed Optimization on MP-SoC

A VLSI array processor for 16-point FFT

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Processing Elements Level Research Articles

Articles published on Processing Elements Level

Reconfigurability, Why It Matters in AI Tasks Processing: A Survey of Reconfigurable AI Chips

MAX2: An ReRAM-Based Neural Network Accelerator That Maximizes Data Reuse and Area Utilization

Towards a Scalable Software Defined Network-on-Chip for Next Generation Cloud.

A Game-Theoretic Approach for Run-Time Distributed Optimization on MP-SoC

A VLSI array processor for 16-point FFT