Abstract
Nowadays, enterprises are inclined to deploy data processing and analytical applications from well-equipped mainframes with highly available hardware components to commodity computers. Commodity machines are less reliable than expensive mainframes. Applications deployed on commodity clusters have to deal with failures that occur frequently. Mostly, these applications perform complex client queries with aggregation and join operations. The longer a query executes, the more it suffers from failures. It causes the entire work has to be re-executed. This paper presents a fault tolerant hash join (FTHJ) algorithm for distributed systems implemented in Apache Ignite. The FTHJ achieves fault tolerance by using a data replication mechanism, materializing intermediate computations. To evaluate FTHJ, we implemented the baseline, unreliable hash join algorithm. Experimental results show that FTHJ takes at least 30% less time to recover and complete join operation when a failure occurs during the execution. This paper describes how we reached a compromise between executing recovery tasks for the least amount of time and using additional resources.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have