Abstract

Recent years witnessed a steep rise in data generation and, consequently, the widespread adoption of software solutions able to support data-intensive applications. Many companies currently engage in data-intensive processes, however, fully embracing a data-driven paradigm is still cumbersome, and establishing a production-ready and fine-tuned deployment is time-consuming. This situation calls for innovative models and techniques to streamline the process of deployment configuration for Big Data applications. Moreover, many companies are using Cloud deployed clusters, which represent a cost-effective alternative to installation on premises. Accurate and fast prediction of the execution time of a Big Data application helps improving design time decisions, reduces Cloud over allocation charges, and assists budget management. In this paper, analytical models based on Stochastic Activity Networks (SANs) are proposed to accurately model the execution of Hadoop, Tez and Spark applications, i.e., the most referred frameworks to support Big Data analyses. The proposed SANs model these applications together with the underlying cluster in order to accurately estimate the execution time. We evaluate the accuracy of the proposed models over the TPC-DS industry benchmark across different configurations. Results obtained by numerically solving the SAN models show an average error of 4.5%, 5.8%, and 2.7% in estimating the execution time of MR, Tez, and Spark applications, respectively, against the data gathered from the experiments demonstrating higher accuracy compared with the state-of-the-art. Moreover, the time required to solve the proposed models is lower than the simulation time of the previously presented approaches in this area.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call