An Efficient Parallel Processor for Dense Tensor Computation

Wei-Pei Huang,Ray C C Cheung,Hong Yan

doi:10.1109/tvlsi.2021.3080318

Abstract

Nowadays, many data are multidimensional, which are called tensors. Tensor computations have been applied in different fields and various software libraries have been developed. However, not much attention has been received for developing a hardware architecture to accelerate the tensor computations. In this article, an efficient and unified processing element (PE) array for the 3-D tensor computation is demonstrated. Our PE array is optimized for thin and tall tensor–matrix multiplication and two types of tensor times matrices chain (TTMc) operations. Our design is evaluated in three study cases and compared with the state-of-the-art design. By using computation partition and rearrangement, data movement between the field-programmable gate array (FPGA) and off-chip DDR memory can be reduced by $O(I^{2})$ , where $I$ is the maximum range among all the dimensions of the data tensor. For TTMc implementation, clock frequency has been increased by 18% compared with the state-of-the-art implementation on the same FPGA chip. An experiment on 3-D volumetric data set rendering by tensor approximation method is conducted for demonstration. For the bricks reconstruction process, the runtime decreased by 50%, i.e., two times faster, on our FPGA implementation compared with that running on GPU. In CANDECOMP/PARAFAC decomposition, for one iteration, the runtime has been decreased by up to 93% compared with the programs implemented by Tensorly, which is a python library.

Full Text