Abstract

Hadoop is a successful open-source implementation of MapReduce programming model. It has been widely adopted by many leading industry companies for big data analytics. However, its intermediate data shuffling is a time-consuming operation that impacts the total execution time of MapReduce programs. Recently, a growing number of organizations are interested in addressing this issue by leveraging the high-performance interconnects, such as InfiniBand and 10 Gigabit Ethernet, which have been popular in High-Performance Computing HPC Community. There is a lack of comprehensive examination of the performance impact of these interconnects on MapReduce programs. In this work, we systematically evaluate the performance impact of two popular high-speed interconnects, 10 Gigabit Ethernet and InfiniBand, using the original Apache Hadoop and our extended Hadoop Acceleration framework. Our analysis shows that, under the Apache Hadoop, although using fast networks can efficiently accelerate the jobs with small intermediate data sizes, it is unable to maintain such advantages for jobs with large intermediate data. In contrast, Hadoop Acceleration provides better performance for jobs of a wide range of data sizes. In addition, both implementations exhibit good scalability under different networks. Hadoop Acceleration significantly reduces CPU utilization and I/O wait time of MapReduce programs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call