Fault tolerance in cloud data centers is a critical mechanism for handling the escalating frequency of failures. As the size and complexity of large-scale systems grow, the challenge of predicting and mitigating failures becomes increasingly exponential, rendering previous solutions inadequate for meeting the high-performance demands of both cloud users and providers. This paper aims to address the growing need for improved fault tolerance by developing a management computational intelligence scheme tailored for real-time, on-demand cloud data center environments. In this study, a fault-tolerant computational intelligence scheme was elicited for the parameters necessary for the design. A computational system was also designed to determine league winners based on scheduling checkpoints. Also, the designed system was simulated using java programming language in CloudSim simulation toolkit (3.0.3) with a customized Cloud Analyst Graphic User Interface (GUI) on the Eclipse Integrated Development Environment (IDE) Luna release 4.4. The developed system (Checkpointed League Winner Algorithm) was compared with existing scheme such as Ant Colony Optimization (ACO), Genetic Algorithm (GA) and League Championship Algorithm (LCA) in real time. Checkpointed League Winner Algorithm was also evaluated to check the scheme's resistance to faults and the improvement percentage of the cloud data centres. The parameters used to evaluate the scheme are: failure to perform Ratio (FPR), Failure that causes a Delay in Performance (FDP), and the Rate at which Performance improves (RPI). The result indicates that when the whole average life of each scheme is considered, Checkpointed League Winner Algorithm (CPLWA) results in a 38.2%, 29.9%, and 20.5% improvement over ACO, GA, and LCA, respectively. The average makespan of the scheme indicates that the Checkpointed League Winner Algorithm exhibits a significant improvement, outperforming the ACO, GA, and LCA with 41%, 33%, and 23%, respectively; the response time of the scheme indicates that the Checkpointed League Winner Algorithm outperformed the ACO, GA, and LCA with 54.3%, 56.6%, and 30.2%, respectively; and the failure ratio of the scheme indicates that the Checkpointed League Winner Algorithm performs better than existing meta-heuristics methods (ACO, GA, and LCA) with a lower failure ratio. This improvement can be attributed to the iterative structure, the migration, and the checkpointing approaches employed in the scheme. This study developed a fault tolerance computational intelligence scheme for an on-demand real-time data centre cloud environment in which the Checkpointed League Winner Algorithm outperformed the existing scheme in terms of response time, average makepan and failure ratio. It is then suggested to explore the application of the Checkpointed League Winner Algorithm scheme to address resource management, provisioning, and virtual machine placement challenges within distributed systems.
Read full abstract