An online and generalized non-negativity constrained model for large-scale sparse tensor estimation on multi-GPU

Linlin Zhuo,Kenli Li,Hao Li,Jiwu Peng,Keqin Li

doi:10.1016/j.neucom.2020.02.068

Abstract

Non-negative Tensor Factorization (NTF) models are effective and efficient in extracting useful knowledge from various types of probabilistic distribution with multi-way information. Current NTF models are mostly designed for problems in computer vision which involve the whole Matricized Tensor TimesKhatri−RaoProduct (MTTKRP). Meanwhile, a Sparse NTF (SNTF) proposed to solve the problem of sparse Tensor Factorization (TF) can result in large-scale intermediate data. A Single-thread-based SNTF (SSNTF) model is proposed to solve the problem of non-linear computing and memory overhead caused by large-scale intermediate data. However, the SSNTF is not a generalized model. Furthermore, the above methods cannot describe the stream-like data from industrial applications in mainstream processors, e.g, Graphics Processing Unit (GPU) and multi-GPU in an online way. To address these two issues, a Generalized SSNTF (GSSNTF) is proposed, which extends the works of SSNTF to the Euclidean distance, KullbackLeibler (KL)-divergence, and ItakuraSaito (IS)-divergence. The GSSNTF only involves the feature elements instead of the entire factor matrices during its update process, which can avoid the formation of large-scale intermediate matrices with convergence and accuracy promises. Furthermore, GSSNTF can merge the new data into the state-of-the-art built tree dataset for sparse tensor, and then online learning has the promise of the correct data format. At last, a model of Compute Unified Device Architecture (CUDA) parallelizing GSSNTF (CUGSSNTF) is proposed on GPU and Multi-GPU (MCUGSSNTF). Thus, CUGSSNTF has linear computing complexity and space requirement, and linear communication overhead on multi-GPU. CUGSSNTF and MCUGSSNTF are implemented on 8 P100 GPUs in this work, and the experimental results from real-world industrial data sets indicate the linear scalability and 40X speedup performances of CUGSSNTF than the state-of-the-art parallelized approachs.

Full Text