Scalable and efficient data distribution for distributed computing of all-to-all comparison problems

Yi-Fan Zhang,Yu-Chu Tian,Wayne Kelly,Colin Fidge

doi:10.1016/j.future.2016.08.020

Abstract

All-to-all comparison problems represent a class of big data processing problems widely found in many application domains. To achieve high performance for distributed computing of such problems, storage usage, data locality and load balancing should be considered during the data distribution phase in the distributed environment. Existing data distribution strategies, such as the Hadoop one, are designed for problems with MapReduce pattern and do not consider comparison tasks at all. As a result, a huge amount of data must be re-arranged at runtime when the comparison tasks are executed, degrading the overall computing performance significantly. Addressing this problem, a scalable and efficient data distribution strategy is presented in this paper with comparison tasks in mind for distributed computing of all-to-all comparison problems. Specifically designed for problems with all-to-all comparison pattern, it not only saves storage space and data distribution time but also achieves load balancing and good data locality for all comparison tasks of the all-to-all comparison problems. Experiments are conducted to demonstrate the presented approaches. It is shown that about 90% of the ideal performance capacity of the multiple machines can be achieved through using the approach presented in this paper.

Full Text