Cloud IaaS platforms readily provide access to homogeneous multi-core machines, whether they are physical ("bare metal") or virtual machines. Each of these machines can be equipped with high-performance SSD disks, enabling the distribution of workflow-generated files across multiple machines, which helps minimize the overhead associated with data transfers. In this paper, we propose a scheduling algorithm called SMDT-ERU (Scheduling for Minimizing Data Transfer - Enhancing Resource Utilization), designed to reduce the makespan of data-intensive workflows by minimizing data transfers between dependent tasks over the network. Intermediate files generated by tasks are stored locally on the disk of the machine where the tasks are executed. Through experimentation, we confirm that increasing the number of cores per machine reduces the additional costs caused by network data transfers. Real-world workflow experiments demonstrate the advantages of the proposed algorithm. Our data-driven scheduling approach significantly reduces execution time and the volume of data transferred over the network, outperforming one of the leading state-of-the-art algorithms, which we have adapted to fit our assumptions.
Read full abstract