High Performance Hierarchical Tucker Tensor Learning Using GPU Tensor Cores

Hao Huang,Xiaodong Wang,Tao Zhang,Anwar Walid,Xiao-Yang Liu,Weiqin Tong

doi:10.1109/tc.2022.3172895

Abstract

Extracting information from large-scale high-dimensional data is a fundamentally important task in high performance computing, where the hierarchical Tucker (HT) tensor learning approach (learning a tensor-tree structure) has been widely used in many applications. However, HT tensor learning algorithms are compute-intensive due to the “ <i>curse of dimensionality</i> ,” i.e., the time complexity grows exponentially with the order of the data tensor. The computation of HT tensor learning algorithms boils down to tensor primitives, which are amenable to computing on GPU tensor cores. Existing work does not support HT tensor learning using GPU tensor cores. There are three main challenges to address: 1) to accelerate tensor learning primitives using GPU tensor cores; 2) to implement the tensor learning algorithms using GPU tensor cores and multiple GPUs; 3) to support large-scale data tensors exceeding the GPU memory capacity. In this paper, we present efficient HT tensor learning primitives using GPU tensor cores and demonstrate three applications. First, we utilize GPU tensor cores to optimize HT tensor learning primitives, including tensor contractions, tensor matricizations and tensor singular value decomposition (SVD). We employ the optimized primitives to optimize HT tensor decomposition algorithms for Big Data analysis. Second, we propose a novel HT tensor layer for deep neural networks, whose training process only involves a forward pass without back propagation. The forward pass consists of tensor operations, thus further exploiting the computing power of GPU tensor cores. Third, we apply the optimized primitives to develop a tensor-tree structured quantum machine learning algorithm <i>tree-tensor network (TTN)</i> . Compared with TensorLy and TensorNetwork on NVIDIA A100 GPUs, our third-order HT tensor decomposition algorithm achieves up to <inline-formula><tex-math notation="LaTeX">$8.92 \times$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$6.42 \times$</tex-math></inline-formula> speedups, respectively, and our high-order case achieves up to <inline-formula><tex-math notation="LaTeX">$32.67 \times$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$23.97 \times$</tex-math></inline-formula> speedups, respectively. Our HT tensor layer for a fully connected neural network achieves <inline-formula><tex-math notation="LaTeX">$49.2 \times$</tex-math></inline-formula> compression at the cost of 0.5% drops in accuracy and <inline-formula><tex-math notation="LaTeX">$1.42 \times$</tex-math></inline-formula> speedup compared with the implementation on CUDA cores; for the AlexNet, our HT tensor layer achieves <inline-formula><tex-math notation="LaTeX">$9.45 \times$</tex-math></inline-formula> compression at the cost of 0.8% drops in accuracy and <inline-formula><tex-math notation="LaTeX">$1.87 \times$</tex-math></inline-formula> speedup compared with the implementation on CUDA cores. Our TTN algorithm achieves up to <inline-formula><tex-math notation="LaTeX">$11.17\times$</tex-math></inline-formula> speedup compared with TensorNetwork, indicating the potential of optimized tensor learning primitives for the classical simulation of quantum machine learning algorithms.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

High Performance Hierarchical Tucker Tensor Learning Using GPU Tensor Cores

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Computers

Lead the way for us

Journal: IEEE Transactions on Computers	Publication Date: Feb 1, 2023
Citations: 6

Similar Papers

Chapter 14 - Quantum Machine Learning
Ivan B Djordjevic
Quantum Information Processing, Quantum Computing, and Quantum Error Correction | VOL. -
Ivan B DjordjevicIvan B Djordjevic
01 Jan 2020
Quantum Information Processing, Quantum Computing, and Quantum Error Correction | VOL. -

Approximation of multi-variable signals and systems : a tensor decomposition approach

-

18 Nov 2015
18 Nov 2015

Nonconvex low-rank tensor approximation with graph and consistent regularizations for multi-view subspace learning
Baicheng Pan ... Hangjun Che
Neural Networks | VOL. 161
Baicheng Pan, et. al.Baicheng Pan ... Hangjun Che
14 Feb 2023
Neural Networks | VOL. 161

Denoising atomic resolution 4D scanning transmission electron microscopy data with tensor singular value decomposition
Chenyu Zhang ... Paul.M Voyles
Ultramicroscopy | VOL. 219
Chenyu Zhang, et. al.Chenyu Zhang ... Paul.M Voyles
25 Sep 2020
Ultramicroscopy | VOL. 219

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

High Performance Hierarchical Tucker Tensor Learning Using GPU Tensor Cores

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Computers