Abstract

A sparse matrix–matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures scratchpad or cache, depending on the phase of the algorithm. The memory and compute units are interconnected with synthesizable coalescing crossbars for efficient memory access. The 2.0-mm $\times $ 2.6-mm chip exhibits 12.6 $\times $ (8.4 $\times $ ) energy efficiency gain, 11.7 $\times $ (77.6 $\times $ ) off-chip bandwidth efficiency gain, and 17.1 $\times $ (36.9 $\times $ ) compute density gain s against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph-based sparse matrices.

Highlights

  • T HE emergence of big data and massive social networks has led to increased importance of graph analytics and machine learning workloads

  • The performance of our 2.0-mm × 2.6-mm accelerator for sparse matrix–matrix multiplication (SpMM), with the chip layout shown in Fig. 9, was evaluated through matrix squaring on synthetic matrices, as well as power-law graphs that are the representatives of the real-world sparse matrices [24], [25]

  • The 48 heterogeneous cores comprised of 32 custom processing element (PE) and 8 Arm Cortex M0 + M4 pairs are tightly coupled via a coalescing crossbar and reconfigurable memory

Read more

Summary

INTRODUCTION

T HE emergence of big data and massive social networks has led to increased importance of graph analytics and machine learning workloads. The compressed sparse row (CSR) format is a standard for storing sparse matrices in graph analytics, scientific computation, and so on [11] It represents an N × N sparse matrix using three arrays— values, column-indices, and row-pointers, with a total storage overhead of 2 · NNZ + N + 1 elements. In the inner-product method, a row of the first operand is multiplied by the column of the second operand to produce a single element in the result matrix While this approach works efficiently for dense matrices, once the matrices become too sparse, a significant portion of the runtime is spent on index matching the two operands to find the NZEs with the same row or column indices. The outer-product approach multiplies the columns of the first operand with the rows of the second operand to generate partial product matrices that are summed together to produce the final result

Architectural Overview
ALGORITHM AND ARCHITECTURE
Outer-Product Algorithm
Scratchpad Prefetching
Sorting Algorithm
CIRCUIT IMPLEMENTATION
Coalescing Crossbar
Reconfigurable Cache
MEASURED RESULTS
Benefits of Reconfigurable Memory
Comparison With the State-of-the-Art Approaches
Sorting Algorithm Comparison
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call