Off-chip Memory Access Research Articles

This article (Colonnade) presents a fully digital bit-serial compute-in-memory (CIM) macro. The digital CIM macro is designed for processing neural networks with reconfigurable 1–16 bit input and weight precisions based on bit-serial computing architecture and a novel all-digital bitcell structure. A column of bitcells forms a column MAC and used for computing a multiply-and-accumulate (MAC) operation. The column MACs placed in a row work as a single neuron and computes a dot-product, which is an essential building block of neural network accelerators. Several key features differentiate the proposed Colonnade architecture from the existing analog and digital implementations. First, its full-digital circuit implementation is free from process variation, noise susceptibility, and data-conversion overhead that are prevalent in prior analog CIM macros. A bitwise MAC operation in a bitcell is performed in the digital domain using a custom-designed XNOR gate and a full-adder. Second, the proposed CIM macro is fully reconfigurable in both weight and input precision from 1 to 16 bit. So far, most of the analog macros were used for processing quantized neural networks with very low input/weight precisions, mainly due to a memory density issue. Recent digital accelerators have implemented reconfigurable precisions, but they are inferior in energy efficiency due to significant off-chip memory access. We present a regular digital bitcell array that is readily reconfigured to a 1–16 bit weight-stationary bit-serial CIM macro. The macro computes parallel dot-product operations between the weights stored in memory and inputs that are serialized from LSB to MSB. Finally, the bit-serial computing scheme significantly reduces the area overhead while sacrificing latency due to bit-by-bit operation cycles. Based on the benefits of digital CIM, reconfigurability, and bit-serial computing architecture, the Colonnade can achieve both high performance and energy efficiency (i.e., both benefits of prior analog and digital accelerators) for processing neural networks. A test-chip with $128 \times 128$ SRAM-based bitcells for digital bit-serial computing is implemented using 65-nm technology and tested with 1–16 bit weight/input precisions. The measured energy efficiency is 117.3 TOPS/W at 1 bit and 2.06 TOPS/W at 16 bit.

Read full abstract

Graph neural networks (GNNs), which extend traditional neural networks for processing graph-structured data, have been widely used in many fields. The GNN computation mainly consists of the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">edge processing to generate messages by combining the edge/vertex features and the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">vertex processing to update the vertex features with aggregated messages. In addition to nontrivial vector operations in the edge processing, huge random accesses and neural network operations in the vertex processing, the graph topology of GNNs may also vary during the computation (i.e., dynamic GNNs). The above characteristics pose significant challenges on existing architectures. In this article, we propose a novel accelerator named CAMBRICON-G for efficient processing of both dynamic and static GNNs. The key of CAMBRICON-G is to abstract the irregular computation of a broad range of GNN variants to the process of regularly tiled <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">adjacent cuboid (which extends the traditional adjacent matrix of graph by adding the dimension of vertex features). The intuition is that the adjacent cuboid facilitates exploitation of both data locality and parallelism by offering <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multidimensional multilevel tiling (including spatial and temporal tiling) opportunities. To perform the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multidimensional spatial tiling , the CAMBRICON-G architecture mainly consists of the cuboid engine (CE) and hybrid on-chip memory. The CE has multiple vertex processing units (VPUs) working in a coordinated manner to efficiently process the sparse data and dynamically update the graph topology with dedicated instructions. The hybrid on-chip memory contains the topology-aware cache and multiple scratchpad memory to reduce off-chip memory access. To perform the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multidimensional temporal tiling , an easy-to-use programming model is provided to flexibly explore different tiling options for large graphs. Experimental results show that compared against Nvidia P100 GPU, the performance and energy efficiency can be improved by <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$7.14\times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$20.18\times $ </tex-math></inline-formula> , respectively, on various GNNs, which validates both the versatility and energy efficiency of CAMBRICON-G.

Read full abstract

Off-chip Memory Access Research Articles

Related Topics

Articles published on Off-chip Memory Access

Colonnade: A Reconfigurable SRAM-Based Digital Bit-Serial Compute-In-Memory Macro for Processing Neural Networks

Cambricon-G: A Polyvalent Energy-Efficient Accelerator for Dynamic Graph Neural Networks

Ferroelectric Field-Effect Transistor-Based 3-D NAND Architecture for Energy-Efficient on-Chip Training Accelerator

Unary Coding and Variation-Aware Optimal Mapping Scheme for Reliable ReRAM-Based Neuromorphic Computing

DESCNet: Developing Efficient Scratchpad Memories for Capsule Network Hardware

Energy-Efficient Accelerator Design With Tile-Based Row-Independent Compressed Memory for Sparse Compressed Convolutional Neural Networks

DAM: Deadblock Aware Migration Techniques for STT-RAM-Based Hybrid Caches

Olympus: Reaching Memory-Optimality on DNN Processors

RRAM-DNN: An RRAM and Model-Compression Empowered All-Weights-On-Chip DNN Accelerator

SuperSlash: A Unified Design Space Exploration and Model Compression Methodology for Design of Deep Learning Accelerators With Reduced Off-Chip Memory Access Volume

A Latency-Optimized Reconfigurable NoC for In-Memory Acceleration of DNNs

A Conflict-free Scheduler for High-performance Graph Processing on Multi-pipeline FPGAs

A 460 GOPS/W Improved Mnemonic Descent Method-Based Hardwired Accelerator for Face Alignment

Fuzzy-Based Thermal Management Scheme for 3D Chip Multicores with Stacked Caches

McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge

Analysis of hardware implementations of deblocking filter for video codecs

Dual-load Bloom filter: Application for name lookup

High-Performance FPGA-Based CNN Accelerator With Block-Floating-Point Arithmetic

MorphIC: A 65-nm 738k-Synapse/mm 2 Quad-Core Binary-Weight Digital Neuromorphic Processor With Stochastic Spike-Driven Online Learning.

Adaptive Quantization as a Device-Algorithm Co-Design Approach to Improve the Performance of In-Memory Unsupervised Learning With SNNs

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Off-chip Memory Access Research Articles

Related Topics

Articles published on Off-chip Memory Access

Colonnade: A Reconfigurable SRAM-Based Digital Bit-Serial Compute-In-Memory Macro for Processing Neural Networks

Cambricon-G: A Polyvalent Energy-Efficient Accelerator for Dynamic Graph Neural Networks

Ferroelectric Field-Effect Transistor-Based 3-D NAND Architecture for Energy-Efficient on-Chip Training Accelerator

Unary Coding and Variation-Aware Optimal Mapping Scheme for Reliable ReRAM-Based Neuromorphic Computing

DESCNet: Developing Efficient Scratchpad Memories for Capsule Network Hardware

Energy-Efficient Accelerator Design With Tile-Based Row-Independent Compressed Memory for Sparse Compressed Convolutional Neural Networks

DAM: Deadblock Aware Migration Techniques for STT-RAM-Based Hybrid Caches

Olympus: Reaching Memory-Optimality on DNN Processors

RRAM-DNN: An RRAM and Model-Compression Empowered All-Weights-On-Chip DNN Accelerator

SuperSlash: A Unified Design Space Exploration and Model Compression Methodology for Design of Deep Learning Accelerators With Reduced Off-Chip Memory Access Volume

A Latency-Optimized Reconfigurable NoC for In-Memory Acceleration of DNNs

A Conflict-free Scheduler for High-performance Graph Processing on Multi-pipeline FPGAs

A 460 GOPS/W Improved Mnemonic Descent Method-Based Hardwired Accelerator for Face Alignment

Fuzzy-Based Thermal Management Scheme for 3D Chip Multicores with Stacked Caches

McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge

Analysis of hardware implementations of deblocking filter for video codecs

Dual-load Bloom filter: Application for name lookup

High-Performance FPGA-Based CNN Accelerator With Block-Floating-Point Arithmetic

MorphIC: A 65-nm 738k-Synapse/mm 2 Quad-Core Binary-Weight Digital Neuromorphic Processor With Stochastic Spike-Driven Online Learning.

Adaptive Quantization as a Device-Algorithm Co-Design Approach to Improve the Performance of In-Memory Unsupervised Learning With SNNs