A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix–Matrix Multiplication Accelerator

Dong-Hyeon Park,Shaolin Xie,Subhankar Pal,David Blaauw,Austin Rovinski,Timothy Wesley,Michael Bedford Taylor,Jielun Tan,Hun-Seok Kim,Siying Feng,Kuan-Yu Chen,Jonathan Beaumont,Trevor Mudge,Aporva Amarnath,Ronald G Dreslinski,Paul Gao,Chun Zhao,Chaitali Chakrabarti

doi:10.1109/jssc.2019.2960480

Abstract

A sparse matrix–matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures scratchpad or cache, depending on the phase of the algorithm. The memory and compute units are interconnected with synthesizable coalescing crossbars for efficient memory access. The 2.0-mm $\times $ 2.6-mm chip exhibits 12.6 $\times $ (8.4 $\times $ ) energy efficiency gain, 11.7 $\times $ (77.6 $\times $ ) off-chip bandwidth efficiency gain, and 17.1 $\times $ (36.9 $\times $ ) compute density gain s against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph-based sparse matrices.

Highlights

T HE emergence of big data and massive social networks has led to increased importance of graph analytics and machine learning workloads
The performance of our 2.0-mm × 2.6-mm accelerator for sparse matrix–matrix multiplication (SpMM), with the chip layout shown in Fig. 9, was evaluated through matrix squaring on synthetic matrices, as well as power-law graphs that are the representatives of the real-world sparse matrices [24], [25]
The 48 heterogeneous cores comprised of 32 custom processing element (PE) and 8 Arm Cortex M0 + M4 pairs are tightly coupled via a coalescing crossbar and reconfigurable memory

Summary

INTRODUCTION

T HE emergence of big data and massive social networks has led to increased importance of graph analytics and machine learning workloads. The compressed sparse row (CSR) format is a standard for storing sparse matrices in graph analytics, scientific computation, and so on [11] It represents an N × N sparse matrix using three arrays— values, column-indices, and row-pointers, with a total storage overhead of 2 · NNZ + N + 1 elements. In the inner-product method, a row of the first operand is multiplied by the column of the second operand to produce a single element in the result matrix While this approach works efficiently for dense matrices, once the matrices become too sparse, a significant portion of the runtime is spent on index matching the two operands to find the NZEs with the same row or column indices. The outer-product approach multiplies the columns of the first operand with the rows of the second operand to generate partial product matrices that are summed together to produce the final result

Architectural Overview

ALGORITHM AND ARCHITECTURE

Outer-Product Algorithm

Scratchpad Prefetching

Sorting Algorithm

CIRCUIT IMPLEMENTATION

Coalescing Crossbar

Reconfigurable Cache

MEASURED RESULTS

Benefits of Reconfigurable Memory

Comparison With the State-of-the-Art Approaches

Sorting Algorithm Comparison

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Journal of Solid-state Circuits	Publication Date: Apr 1, 2020
Citations: 32	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix–Matrix Multiplication Accelerator

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Journal of Solid-state Circuits

Lead the way for us

Similar Papers

A 7.3 M Output Non-Zeros/J Sparse Matrix-Matrix Multiplication Accelerator using Memory Reconfiguration in 40 nm
Subhankar Pal ... Siying Feng
-
Subhankar Pal, et. al.Subhankar Pal ... Siying Feng
01 Jun 2019
A 7.3 M Output Non-Zeros/J Sparse Matrix-Matrix Multiplication Accelerator using Memory Reconfiguration in 40 nm
Subhankar Pal ... Siying Feng

Joint Beamforming and PAPR Reduction in Massive MIMO: Analysis of Gain in Energy Efficiency
Christian A Schmidt ... Jean-Francois Helard
Control theory & applications | VOL. -
Christian A Schmidt, et. al.Christian A Schmidt ... Jean-Francois Helard
12 Oct 2020
Control theory & applications | VOL. -

GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks
Guyue Huang ... Huazhong Yang
-
Guyue Huang, et. al.Guyue Huang ... Huazhong Yang
01 Nov 2020
01 Nov 2020

Rgs-SpMM: Accelerate Sparse Matrix-Matrix Multiplication by Row Group Splitting Strategy on the GPU
Mingfeng Guo ... Yaqing Zhang
-
Mingfeng Guo, et. al.Mingfeng Guo ... Yaqing Zhang
01 Jan 2021
01 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix–Matrix Multiplication Accelerator

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Journal of Solid-state Circuits