Parallelization implementation of Bayesian algorithms based on Spark platform

Hong Liu,Wei Hu,Hong Guo

doi:10.1109/smartcloud.2019.00042

Abstract

With the rapid development of Internet technology, all kinds of data are growing exponentially. How to effectively manage and utilize these data has become the focus of research in the era of big data. Under the requirement of massive data processing, aiming at the time requirement of massive data processing which cannot be met by traditional single-machine serial, this paper proposes a Spark computing framework, studies Bayesian algorithm in data mining, realizes the establishment method of parallel Bayesian algorithm and optimizes it. By using Spark memory computing framework, the efficiency of iteration is high. The computational performance of the parallel computing program is investigated. By comparing Spark parallel computing with traditional singlemachine serial experiments, it is found that the algorithm can effectively improve the speed of text classification. With the expansion of cluster size, the performance of classification accuracy, time performance and acceleration ratio is better. Parallel Bayesian algorithm based on Spark platform is feasible, which solves the problem that traditional single computer cannot handle large-scale data, and can effectively deal with all kinds of classification problems.

Full Text