Abstract

We propose a novel architecture to efficiently perform sparse tensor decomposition/completion. As the generalization of vectors and matrices, tensors are widely used to process high-dimensional data. Sparse tensor decomposition (SpTD) is not only an emerging tensor analysis technique but also an effective tool to reduce the storage and computation costs of tensors. However, conventional general-purpose processors are inefficient to perform SpTD, mainly due to: i) variable sparsity degree and flexible buffer size requirement; ii) difficulties of fusing multiple execution kernels to pursue better performance. For domain-specific accelerator designers on the other hand, the diversity of decomposition algorithms is also an important problem that must be considered. To solve these challenges, we propose a unified abstraction for SpTD algorithms and design a specialized accelerator. First, we formulate two types of core kernels (SpLrMM and LrSampling) that serve as a standard form to fit a broad range of SpTD algorithms. Second, we design a sparse tensor engine (STE) to efficiently perform SpTD. STE uses a processing element (PE)-interactive architecture where PEs can be flexibly grouped together via Network-on-Chip (NoC) to share the buffer capacity, bandwidth, and compute resources. We evaluate our accelerator with extensive experiments, and it can achieve an average speedup of 45× over CPU and 29× over GPU.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call