CuLDA

Xiaolong Xie,Wei Tan,Xiuhong Li,Yun Liang

doi:10.1145/3307681.3325407

Abstract

Latent Dirichlet Allocation(LDA) is a popular topic model. Given the fact that the input corpus of LDA algorithms consists of millions to billions of tokens, the LDA training process is very time-consuming, which prevents the adoption of LDA in many scenarios, e.g., online service. GPUs have benefited modern machine learning algorithms and big data analysis as they can provide high memory bandwidth and tremendous computation power. Therefore, many frameworks, e.g. TensorFlow, Caffe, CNTK, support GPUs for accelerating various data-intensive machine learning algorithms. However, we observe that the performance of existing LDA solutions on GPUs is not satisfying. In this paper, we present CuLDA, a GPU-based efficient and scalable approach to accelerate large-scale LDA problems. CuLDA is designed to efficiently solve LDA problems at high throughput. To this end, we first delicately design workload partitioning and synchronization mechanism to exploit multiple GPUs. Then, we offload the LDA sampling process to each individual GPU by optimizing from the sampling algorithm, parallelization, and data compression perspectives. Experiment evaluations show that compared with the state-of-the-art LDA solutions, CuLDA outperforms them by a large margin (up to 7.3X) on a single GPU. CuLDA is able to achieve an extra 7.5X speedup on 8 GPUs for large data sets.

Full Text