Learning distributed discrete Bayesian Network Classifiers under MapReduce with Apache Spark

Jacinto Arias,Jose A Gamez,Jose M Puerta

doi:10.1016/j.knosys.2016.06.013

Abstract

The challenge of scalability has always been a focus on Machine Learning research, where improved algorithms and new techniques are proposed in a constant basis to deal with more complex problems. With the advent of Big Data, this challenge has been intensified, in which new large scale datasets overwhelm the majority of available techniques. The community has resorted to Cloud Computing and distributed programming paradigms as the most immediate solution where Apache Spark has proven to be the most promising framework. In this paper we focus on the problem of supervised classification, exploring the family of the so called Bayesian Network Classifiers by studying their adaptability to the MapReduce and Apache Spark frameworks. We will analyse a range of algorithms and propose distributed versions of them. Our approach is based on a general framework for learning this probabilistic models from large scale and high dimensional data, the latter being a problem with less support in the literature. We also present an extensive experimental evaluation of our proposal over a wide set of problems and different elastic configurations of a computing cluster to show the full extent of the scalability properties of our framework. Additional material and the software to reproduce our experiments can be found on the supplementary website http://simd.albacete.org/supplements/distributed_bncs.html.

Full Text