Abstract

Tensor algebra finds applications in various domains including machine learning applications, data analytics and others. Spatial hardware accelerators are widely used to boost the performance of tensor algebra applications. It has a complex hardware architecture and rich design space. Prior approaches based on manual implementation lead to low programming productivity, making it hard to explore the large design space. In this paper, we propose Tensorlib, a framework for generating spatial hardware accelerators for tensor algebra applications. Tensorlib is motivated by the observation that, tensor dataflows can be expressed with linear transformations, and they share common hardware modules which can be reused across different designs. Tensorlib first uses Space-Time Transformation to explore different dataflows, which can compactly represent the hardware dataflow using a transformation matrix. Next, we identify the common structures of different dataflows and build parameterized hardware module templates. Our generation framework can select the needed hardware modules for each dataflow, connect the modules using a specified interconnection pattern, and automatically generate the complete hardware accelerator design. Tensorlib remarkably improves the productivity for the development and optimization of spatial hardware architecture, providing a rich design space with trade-offs in performance, area, and power. Experiments show that Tensorlib can automatically generate hardware designs with different dataflows for a variety of tensor algebra programs. Tensorlib can achieve 318 MHz frequency and 786 GFLOP/s throughput for matrix multiplication kernel on Xilinx VU9P FPGA, which outperforms the state-of-the-art generators.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call