Abstract

Extracting information from large-scale high-dimensional data is a fundamentally important task in high performance computing, where the hierarchical Tucker (HT) tensor learning approach (learning a tensor-tree structure) has been widely used in many applications. However, HT tensor learning algorithms are compute-intensive due to the “ <i>curse of dimensionality</i> ,” i.e., the time complexity grows exponentially with the order of the data tensor. The computation of HT tensor learning algorithms boils down to tensor primitives, which are amenable to computing on GPU tensor cores. Existing work does not support HT tensor learning using GPU tensor cores. There are three main challenges to address: 1) to accelerate tensor learning primitives using GPU tensor cores; 2) to implement the tensor learning algorithms using GPU tensor cores and multiple GPUs; 3) to support large-scale data tensors exceeding the GPU memory capacity. In this paper, we present efficient HT tensor learning primitives using GPU tensor cores and demonstrate three applications. First, we utilize GPU tensor cores to optimize HT tensor learning primitives, including tensor contractions, tensor matricizations and tensor singular value decomposition (SVD). We employ the optimized primitives to optimize HT tensor decomposition algorithms for Big Data analysis. Second, we propose a novel HT tensor layer for deep neural networks, whose training process only involves a forward pass without back propagation. The forward pass consists of tensor operations, thus further exploiting the computing power of GPU tensor cores. Third, we apply the optimized primitives to develop a tensor-tree structured quantum machine learning algorithm <i>tree-tensor network (TTN)</i> . Compared with TensorLy and TensorNetwork on NVIDIA A100 GPUs, our third-order HT tensor decomposition algorithm achieves up to <inline-formula><tex-math notation="LaTeX">$8.92 \times$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$6.42 \times$</tex-math></inline-formula> speedups, respectively, and our high-order case achieves up to <inline-formula><tex-math notation="LaTeX">$32.67 \times$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$23.97 \times$</tex-math></inline-formula> speedups, respectively. Our HT tensor layer for a fully connected neural network achieves <inline-formula><tex-math notation="LaTeX">$49.2 \times$</tex-math></inline-formula> compression at the cost of 0.5% drops in accuracy and <inline-formula><tex-math notation="LaTeX">$1.42 \times$</tex-math></inline-formula> speedup compared with the implementation on CUDA cores; for the AlexNet, our HT tensor layer achieves <inline-formula><tex-math notation="LaTeX">$9.45 \times$</tex-math></inline-formula> compression at the cost of 0.8% drops in accuracy and <inline-formula><tex-math notation="LaTeX">$1.87 \times$</tex-math></inline-formula> speedup compared with the implementation on CUDA cores. Our TTN algorithm achieves up to <inline-formula><tex-math notation="LaTeX">$11.17\times$</tex-math></inline-formula> speedup compared with TensorNetwork, indicating the potential of optimized tensor learning primitives for the classical simulation of quantum machine learning algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call