Squaring the circle

Xin He,Kuan-Yu Chen,Hun-Seok Kim,Trevor Mudge,Siying Feng,Ronald Dreslinski,David Blaauw

doi:10.1145/3559009.3569665

Abstract

Systolic arrays have been successful to accelerate dense linear algebra for deep neural networks (DNNs), but cannot handle sparse computations efficiently. Though early attempts have been made to perform sparse matrix operations on weight-pruned DNNs, handling highly sparse matrices with skewed nonzero distribution commonly seen in real-world graph analytics remains challenging. In this paper, we propose FlexTPU framework to repurpose tensor processing units (TPUs) to execute sparse matrix-vector operations (SpMV). First, we propose a lightweight Z-shape mapping of sparse matrices onto the systolic array to eliminate the processing of zeros as much as possible, regardless of the sparsity and nonzero distribution. On top of the mapping, we devise an SpMV dataflow executed by an array of PEs, which are a slightly modified version of the conventional TPU PE. Second, in contrast to the excess preprocessing mandatory for prior attempts, the Z-shape mapping facilitates on-the-fly matrix condensing from the widely-used compressed sparse matrix (e.g. CSR) representation. This is accomplished by a proposed sparse data loader that includes an on-chip row decoder and parallel nonzero loaders. We evaluate FlexTPU on a broad set of synthetic and real-world sparse matrices. The experimental result shows that FlexTPU achieves 3.55× speedup and 3.27× energy saving over a state-of-the-art design, Sparse-TPU. It performs even better on sparse matrices with power-law distributions. Compared to state-of-the-art library implementations on a CPU and a GPU, FlexTPU also achieves an average speedup of 2.4× and 4.3×, and energy saving of 130.4× and 495.3×, respectively. FlexTPU is also evaluated against a recent re configurable (chip multi-processor) CMP machine, Transmuter. FlexTPU outperforms Transmuter by achieving 5.12× speedup and 2.65× energy saving.

Full Text