Scalable aggregation predictive analytics

Christos Anagnostopoulos,Fotis Savva,Peter Triantafillou

doi:10.1007/s10489-017-1093-y

Christos Anagnostopoulos, Fotis Savva + Show 1 more

Open Access

https://doi.org/10.1007/s10489-017-1093-y

Copy DOI

Abstract

We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method.

Highlights

Recent R&D efforts in the modern big data era have been dominated by efforts to accommodate distributed big datasets with frameworks that enable highly quality and scalable distributed/parallel data analyzes
The reason we focus on the COUNT aggregation operator is that the answer Set Cardinality Prediction (SCP) of a multidimensional range query is a fundamental task, playing a
– We implement our Machine Learning (ML) model within the Spark system; – We provide comprehensive experiments showing the quality of prediction of our ML model through a variety of evaluation metrics. – We experiment with the scalability performance of our ML model compared with the Spark’s COUNT method for answer-set cardinality estimation

Summary

Introduction

Recent R&D efforts in the modern big data era have been dominated by efforts to accommodate distributed big datasets with frameworks that enable highly quality and scalable distributed/parallel data analyzes. Predictive modeling [26], [23] and exploratory analysis [2, 3, 6, 20] are commonly based on statistical aggregation operators over the results of exploration queries [4, 7]. Such queries involve large datasets (which may themselves be the result of linking of other different datasets) and a number of range predicates over multidimensional data vectorial representation, structured, semi- and unstructured data. Imagine exploratory and predictive analytics [9] based on a stream of such aggregation operators over data subspaces being issued, until the scientists/analysts extract sufficient statistics or fit local function estimators, e.g., coefficient of determination, product-moment correlation coefficient, and multivariate local linear approximation over the subspaces of interest

Objectives

Methods

Results

Conclusion