Label-Aware Distributed Ensemble Learning: A Simplified Distributed Classifier Training Model for Big Data

Shadi Khalifa,Patrick Martin,Rebecca Young

doi:10.1016/j.bdr.2018.11.001

Abstract

Abstract Label-Aware Distributed Ensemble Learning (LADEL) is a programming model and an associated implementation for distributing any classifier training to handle Big Data. It only requires users to specify the training data source, the classification algorithm and the desired parallelization level. First, a distributed stratified sampling algorithm is proposed to generate stratified samples from large, pre-partitioned datasets in a shared-nothing architecture. It executes in a single pass over the data and minimizes inter-machine communication. Second, the specified classification algorithm training is parallelized and executed on any number of heterogeneous machines. Finally, the trained classifiers are aggregated to produce the final classifier. Data miners can use LADEL to run any classification algorithm on any distributed framework, without any experience in parallel and distributed systems. The proposed LADEL model can be implemented on any distributed framework (Drill, Spark, Hadoop, etc.) to speed up the development of its data mining capabilities. It is also generic and can be used to distribute the training of any classification algorithm of any sequential single-node data mining library (Weka, R, scikit-learn, etc.). Distributed frameworks can implement LADEL to distribute the execution of existing data mining libraries without rewriting the algorithms to run in parallel. As a proof-of-concept, the LADEL model is implemented on Apache Drill to distribute the training execution of Weka's classification algorithms. Our empirical studies show that LADEL classifiers have similar and sometimes even better accuracy to the single-node classifiers and they have a significantly faster training and scoring times.

Full Text