Abstract

The amount of energy needed to operate high-performance computing systems increases regularly since some years at a high pace, and the energy consumption has attracted a great deal of attention. Moreover, high energy consumption inevitably contains failures and reduces system reliability. However, there has been considerably less work of simultaneous management of system performance, reliability, and energy consumption on heterogeneous systems. In this paper, we first build the precedence-constrained parallel applications and energy consumption model. Then, we deduce the relation between reliability and processor frequencies and get their parameters approximation value by least squares curve fitting method. Thirdly, we establish a task execution reliability model and formulate this reliability and energy aware scheduling problem as a linear programming. Lastly, we propose a heuristic Reliability-Energy Aware Scheduling (REAS) algorithm to solve this problem, which can get good tradeoff among system performance, reliability, and energy consumption with lower complexity. Our extensive simulation performance evaluation study clearly demonstrates the tradeoff performance of our proposed heuristic algorithm.

Highlights

  • For a long time, energy consumption has been ignored in the performance evaluation in large-scale parallel computing systems

  • We compare the performance, energy consumption, and system reliability using our Reliability-Energy Aware Scheduling (REAS) algorithm with three existing scheduling algorithms: dynamic-level scheduling (DLS) [6], reliable dynamic level scheduling algorithm (RDLS) [27], and ECS [20]

  • In the past few years, with the rapid development of heterogeneous systems, the high price of energy, system performance, reliability, and various environmental issues have forced the high-performance computing sector to reconsider some of its old practices with an aim to create more sustainable system

Read more

Summary

Introduction

Energy consumption has been ignored in the performance evaluation in large-scale parallel computing systems. The number of transistors integrated into today’s Intel Xeon EX processor reaches to nearly 2.3 billion and its power consumption over 130 W [3] This implies the possibility of worsening single processor reliability, eventually resulting in poorness of the whole heterogeneous system reliability. Even when the single processor’s one-hour reliability becomes very high, such as 0.999999, as the system size approaches 10,000 cores, the system’s MTTF (the Mean Time to Failure) drops to less than 10 hours [4].

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call