Accelerating low rank matrix completion on FPGA

Shijie Zhou,Rajgopal Kannan,Viktor K Prasanna

doi:10.1109/reconfig.2017.8279771

Abstract

Low Rank Matrix Completion (LRMC) is widely used in the analysis of incomplete datasets. In this paper, we propose a novel FPGA-based accelerator to speedup a matrix-factorization-based LRMC algorithm that uses stochastic gradient descent. The accelerator is a multi-pipelined architecture with parallel pipelines processing distinct data from a shared on-chip buffer. We propose two distinct on-chip buffer architectures based on a design-space exploration of the performance tradeoffs offered by two competing design methodologies: memory-efficiency versus concurrent conflict-free accesses. Our first design (i.e., memory-efficient design) organizes the buffer into banks and maximally utilizes available on-chip memory for matrix chunk processing without requiring complex address translation tables for on-chip addressing; however, it could incur bank conflicts when concurrent accesses to the same bank occur. The second design (i.e., bank-conflict-free design) exploits parallel multiport memory access and completely eliminates bank conflicts by duplicating the stored data; however, it has much higher on-chip RAM consumption. Intuitively, design one enables (slower) acceleration of (larger) chunks of the input matrix whereas design two enables (faster) processing of (smaller) matrix chunks but requires more iterations for processing the complete matrix. We propose a simple but efficient partitioning approach for supporting large input matrices that do not fit in the on-chip memory of FPGA. We also develop algorithmic optimizations based on matching to reduce data dependencies for parallel pipeline execution. We implement our designs on a state-of-the-art UltraScale+ FPGA device. We use real-life datasets for the evaluation and compare these two designs by varying the number of pipelines. The data dependency optimization results in at least 21.6 x data dependency reduction and improves the execution time by up to 66.3 x compared with non-optimized baseline designs. The memory-efficient design is also shown to be more scalable than the bank-conflict-free design. Compared with the state-of-the-art multi-core implementation and GPU implementation, the bank-conflict-free design achieves 5.4 x and 5.2 x speedup, respectively; the memory-efficient design achieves 16.7 x and 16.2 x speedup, respectively.

Full Text