Nowadays, more and more computation-intensive scientific applications with diverse needs are migrating to cloud computing systems. However, the cloud systems alone cannot meet applications’ requirements at all times with the increasing demands from users. Therefore, the multi-cloud systems that can provide scalable storage and computing resources become a good solution. The main challenges for such systems are multiple billing mechanisms, virtual resources heterogeneity, and systems reliability. In response to these challenges, we first build a multi-cloud systems fault-tolerant workflow scheduling framework, which tries to improve the scientific applications execution reliability and reduce their execution cost. Then, we use Weibull distribution to analyze task execution reliability and hazard rate, which is used to duplicate task with high execution hazard rate. Third, we integrate different multi-cloud providers’ billing mechanism into the proposed scheduling framework, and this workflow scheduling problem is mathematically formulated as an optimization problem. Fourth, we define the DAG tasks cost-efficient bottom level, and propose a fault-tolerant cost-efficient workflow scheduling algorithm (FCWS) that minimizes application execution cost, time while ensuring their reliability. Simulation experiments for performance evaluation were conducted based on two real-world applications: Epigenomics and LIGO. The results clearly demonstrate that our proposed FCWS algorithm outperforms existing FR-MOS, CWS in terms of cost and reliability, and FCWS is also better than CWS and inferior to FR-MOS in term of makespan.
Read full abstract