Scalable Machine Learning in the R Language Using a Summarization Matrix

Siva Uday Sampreeth Chebolu,Sikder Tahsin Al-Amin,Carlos Ordonez

doi:10.1007/978-3-030-27618-8_19

Abstract

Big data analytics generally rely on parallel processing in large computer clusters. However, this approach is not always the best. CPUs speed and RAM capacity keep growing, making small computers faster and more attractive to the analyst. Machine Learning (ML) models are generally computed on a data set, aggregating, transforming and filtering big data, which is orders of magnitude smaller than raw data. Users prefer “easy” high-level languages like R and Python, which accomplish complex analytic tasks with a few lines of code, but they present memory and speed limitations. Finally, data summarization has been a fundamental technique in data mining that has great promise with big data. With that motivation in mind, we adapt the \(\varGamma \) (Gamma) summarization matrix, previously used in parallel DBMSs, to work in the R language. \(\varGamma \) is significantly smaller than the data set, but captures fundamental statistical properties. \(\varGamma \) works well for a remarkably wide spectrum of ML models, including supervised and unsupervised models, assuming dimensions (variables) are either dependent or independent. An extensive experimental evaluation proves models on summarized data sets are accurate and their computation is significantly faster than R built-in functions. Moreover, experiments illustrate our R solution is faster and less resource hungry than competing parallel systems including a parallel DBMS and Spark.

Full Text