Abstract

To execute distributed joins in parallel on compute clusters, systems partition and exchange data records between workers. With large datasets, workers spend a considerable amount of time transferring data over the network. When compute clusters are shared among multiple applications, workers must compete for network bandwidth with other applications. These variances in the available network bandwidth lead to network skew , which causes straggling workers to prolong the join completion time. We describe SquirrelJoin , a distributed join processing technique that uses lazy partitioning to adapt to transient network skew in clusters. Workers maintain in-memory lazy partitions to withhold a subset of records, i.e. not sending them immediately to other workers for processing. Lazy partitions are then assigned dynamically to other workers based on network conditions: each worker takes periodic throughput measurements to estimate its completion time, and lazy partitions are allocated as to minimise the join completion time. We implement SquirrelJoin as part of the Apache Flink distributed dataflow framework and show that, under transient network contention in a shared compute cluster, SquirrelJoin speeds up join completion times by up to 2.9× with only a small, fixed overhead.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.