Scalable machine learning computing a data summarization matrix with a parallel array DBMS

Carlos Ordonez,Yiqun Zhang,S Lennart Johnsson

doi:10.1007/s10619-018-7229-1

Abstract

Big data analytics requires scalable (beyond RAM limits) and highly parallel (exploiting many CPU cores) processing of machine learning models, which in general involve heavy matrix manipulation. Array DBMSs represent a promising system to manipulate large matrices. With that motivation in mind, we present a high performance system exploiting a parallel array DBMS to evaluate a general, but compact, matrix summarization that benefits many machine learning models. We focus on two representative models: linear regression (supervised) and PCA (unsupervised). Our approach combines data summarization inside the parallel DBMS with further model computation in a mathematical language (e.g. R). We introduce a two-phase algorithm which first computes a general data summary in parallel and then evaluates matrix equations with reduced intermediate matrices in main memory on one node. We present theory results characterizing speedup and time/space complexity. From a parallel data system perspective, we consider scale-up and scale-out in a shared-nothing architecture. In contrast to most big data analytic systems, our system is based on array operators programmed in C++, working directly on the Unix file system instead of Java or Scala running on HDFS mounted of top of Unix, resulting in much faster processing. Experiments compare our system with Spark (parallel) and R (single machine), showing orders of magnitude time improvement. We present parallel benchmarks varying number of threads and processing nodes. Our two-phase approach should motivate analysts to exploit a parallel array DBMS for matrix summarization.

Full Text