Abstract
As one of the most recognized models in machine learning, the conditional random fields (CRF) has been widely used in many applications. As the parameter estimation of CRF is highly time-consuming, how to improve the performance of CRF has received significant attention, in particular in the big data environment. To deal with large-scale data, CPU-based or GPU-based parallelization solutions have been proposed to improve performance. However, the problem is an ongoing one. In this paper, we focus on the big data environment and propose a distributed CRF on a heterogeneous CPU-GPU cluster called DHCRF. Our approach differs from previous work. Specifically, it leverages a three-stage heterogeneous Map and Reduce operation to improve the performance, making full use of CPU-GPU collaborative computing capabilities in a big data environment. Furthermore, by combining elastic data partition and intermediate results multiplexing method, the distributed CRF is optimized. Elastic data partition is performed to keep the load balanced, and the intermediate results multiplexing method is adopted to reduce data communication. Experimental results show that the DHCRF outperforms the baseline CRF algorithm and the CPU-based parallel CRF algorithm with notable performance improvement while maintaining competitive correctness at the same time.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have