Abstract

High-performance computing clusters are widely used in large-scale data mining applications, and have higher requirements for persistence, stability and real-time use and sre therefore computationally intensive. To support large-scale data processing, we design a multi-factor real-time monitoring fault tolerance (MRMFT) model based on a GPU cluster. However, the higher clock frequency of GPU chips results in excessively high energy consumption in computing systems. Moreover, the ability to support a long-lasting high temperature operation varies greatly between different GPUs owing to the individual differences between the chips. In this paper, we design a GPU cluster energy consumption monitoring system based on wireless sensor networks (WSNs) and propose an energy consumption aware checkpointing (ECAC) for high energy consumption problems with the following two advantages: the system sets checkpoints according to actual energy consumption and the device temperature to improve the utilization of checkpoints and reduce time cost; and it exploits the parallel computing features of CPU and GPU to hide the CPU detection overhead in GPU parallel computation, and further reduce the time and energy consumption overhead in the fault tolerance phase. Using ECAC as the constraint and aiming for a persistent and reliable operation, the dynamic task migration mechanism is designed, and the reliability of the cluster is greatly improved. The theoretical analysis and experiment results show that the model improves the persistence and stability of the computing system while reducing checkpoint overhead.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.