Abstract

Over the last decade, the volume of data has been growing at a larger rate in comparison to the processing power available. The advent of distributed computing was essential in being able to handle these vast amounts of data. However, the distribution of data across the systems may not be uniform and gives rise to the problems of data skew and performance skew. A key challenge is to estimate the effective performance skew of a set of queries based on the data skew of the dataset on a multi-computing cluster. We use HPCC Systems, a modern big data management and analysis tool. Methods used to measure the impact of performance skew on the performance of queries on a HPCC cluster are heavily dependent on human interpretation. This project aims to automate the process of skew prediction by analyzing the execution graphs of a job on the HPCC Systems cluster and predicting the probable performance skew for a given set of queries using a Random Forest Regressor Model.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call