Abstract

Computing machines and communication links may fail permanently with nonzero probability in heterogeneous distributed computing systems (HDCSs), and the results of running applications (i.e., large-scale parallel image processing and neuroimaging) on these systems will expect to deteriorate over time. Therefore, the reliability and performance of applications on HDCS remain an imperative and open issue, especially when the parallel applications are scheduled on graphics processing unit architectures. It is urgent to tackle the problem of maximizing performance and reliability considering the impact of communication and machine failures. This work presents a rigorous probabilistic theory to analytically characterize the performance and reliability of an effective task scheduling in the presence of processor and communication failure. An optimal communication path search algorithm considering Reliability overhead and a reliability-driven lookahead scheduling algorithm for precedence constrained tasks are developed. The theoretical model and experimental data, which are based on randomly generated emulation applications represented by directed acyclic graph, reveal that the proposed algorithms significantly outperform previously existing scheduling algorithms in terms of expected makespan, reliability, and schedule length ratio. The weaknesses of the algorithms related to the input parameters are also observed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.