Abstract

Canonical Polyadic Decomposition (CPD) is one of the most popular tensor decomposition methods and plays an important role in big data analysis. For sparse tensor, the major computation procedure in CPD, which is known as matricized tensor times Khatri-Rao product (MTTKRP), exhibits discontinuous memory access and turns to be the performance bottleneck from achieving high performance on emerging processor architectures. In this paper, we propose swCPD, an efficient CPD implementation on the many-core Sunway processor. The swCPD accelerates the optimization algorithms dominating the performance of MTTKRP, including Alternating Least Squares (ALS), Gradient Descent (GD) and Randomized Block Sampling (RBS), as well as the latest fast Levenberg–Marquardt (fLM++) and Generalized Canonical Polyadic Decomposition with Stochastic Gradient Descent (GCP-SGD). The main idea adopted in swCPD is a hierarchical partitioning mechanism. From the computation perspective, the 64 Computation Processing Elements (CPEs) in a Sunway processor are divided into eight groups, with each group containing seven workers and one controller. From the data perspective, we partition the sparse tensor into different granularities, which are blocks, bands and tiles. Moreover, we develop a communication mechanism through register communication for cooperation between CPEs. We evaluate the implementation of swCPD with both synthesized and real-world datasets. The experiment results show that each optimized algorithm in swCPD achieves better performance than corresponding algorithms adopted in cutting-edge CPD implementations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call