Abstract

Optimizations of primitive routines such as General Matrix Multiplication (GEMM) continue to advance the state of the art performance for their applications. Various libraries such as Basic Linear Algebra Subprograms (BLAS) exist that provide API interfaces to highly tuned, hardware-specific implementations. Applications such as deep learning push the limits of what is possible from these subroutines by necessitating unique optimizations like lowering numeric precision and data type bit-width, exploiting resiliency, and removing redundancy. Hardware plays a considerable role in the performance of a subroutine because of the inherent layout of memory and arithmetic structures such as the SIMD processors found in general CPU and GPU architectures. FPGAs play a unique role in this space because the reconfigurable circuits and routing provide a pipelined architecture capable of both SIMD and MIMD-like architectures. Within a pipeline, the capability of accelerating operations on FPGA through low-bit and fine-grained designs is typically only seen in ASICs. In this paper, we provide an overview of an OpenCL based GEMM accelerator design that exploits sparsity and compression to persist data in FPGA on-chip, fine-grained SRAM. The design includes support for successive GEMMs with added activation functions, making it suitable for some machine learning applications. Results are measured running on Intel's Arria 10 GX 1150 FPGA. Compared to non-sparse and non-persistent designs, we achieve a speedup towards the theoretical limit of 6.5× for our fine-grained implementation and 8× for our structured implementation. We measure over 1 TOP/s utilizing only 17% of Arria 10 DSP blocks for a 90% sparse design.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call