Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark

Peng Liu,Zong-Wei Zhu,Jia-Yu Teng,Hui-Han Zhao,Ya-Feng Liu,Yan-Yan Yang

doi:10.1007/s11771-019-3978-x

Abstract

The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data. In order to solve this problem, this paper proposes and implements a parallel naive Bayes algorithm (PNBA) for Chinese text classification based on Spark, a parallel memory computing platform for big data. This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets (RDD). For comparison, a PNBA based on Hadoop is also implemented. The test results show that in the same computing environment and for the same text sets, the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability. Therefore, Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining.

Full Text