Abstract

Lack of knowledge in the underlying data distribution in distributed large-scale data can be an obstacle when issuing analytics & predictive modelling queries. Analysts find themselves having a hard time finding analytics/exploration queries that satisfy their needs. In this paper, we study how exploration query results can be predicted in order to avoid the execution of ‘bad’/non-informative queries that waste network, storage, financial resources, and time in a distributed computing environment. The proposed methodology involves clustering of a training set of exploration queries along with the cardinality of the results (score) they retrieved and then using query-centroid representatives to proceed with predictions. After the training phase, we propose a novel refinement process to increase the reliability of predicting the score of new unseen queries based on the refined query representatives. Comprehensive experimentation with real datasets shows that more reliable predictions are acquired after the proposed refinement method, which increases the reliability of the closest centroid and improves predictability under the right circumstances.

Highlights

  • Due to the importance and relevance of data in distributed computing environments, large-scale data analytics, predictive modelling, and exploration tasks, they have rightfully found their place in almost all, if not all, of today’s industries

  • Apart from the frustration that might be involved in finding the correct query, executing the aforementioned queries can lead to the waste of network and storage resources that are involved in transferring and storing query results among computing nodes in a distributed computing environment

  • We hypothesize whether we can determine if a query is worth executing based on score prediction and user criteria in distributed computing environments based on query-driven mechanisms

Read more

Summary

Introduction

Due to the importance and relevance of data in distributed computing environments, large-scale data analytics, predictive modelling, and exploration tasks, they have rightfully found their place in almost all, if not all, of today’s industries. Exploration querying acts as a solution for accessing distributed data, in most cases there is lack of knowledge about the underlying data distributions and their impact on the results. Apart from the frustration that might be involved in finding the correct query, executing the aforementioned queries can lead to the waste of network and storage resources that are involved in transferring and storing query results among computing nodes in a distributed computing environment (including processed data or even raw data for analytics tasks)

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call