Abstract
Modern data analytics platforms are often coupled with external data storage services such as Amazon S3, resulting in storage bottlenecks. Existing caching and prefetching solutions use higher-level information from data analytics frameworks, such as job dependency graphs(e.g., DAGs) and historical run time information, to predict future data accesses and then prefetch data into the cache and manage the cache contents based on those predictions.However, in doing so, they are not taking advantage of a fundamental opportunity: rather than caching data given a prediction of job execution, we can actually influence the job execution order to enable more effective caching and prefetching. With this key insight, we devise a set of novel heuristics and then design a system Tripod, which harmonizes job scheduling and data caching for analytics frameworks. With the higher-level information from analytics frameworks, Tripod explores a best-suited job execution order for prefetching and caching guided by the devised heuristics.We have implemented Tripod as extensions to Apache YARN and Tez. Our evaluation using standard analytic benchmarks (TPC-H and TPC-DS) shows that Tripod achieves up to 1.7x speedup over state-of-the-art approaches.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.