Algorithms and Scheduling Techniques to Manage Resilience and Power Consumption in Distributed Systems (Dagstuhl Seminar 15281)

Henri Casanova ,Yves Robert ,Ewa Deelman ,Uwe Schwiegelshohn

doi:10.4230/dagrep.5.7.1

Abstract

Large-scale systems face two main challenges: failure management and energy management. Failure management, the goal of which is to achieve resilience, is necessary because a large number of hardware resources implies a large number of failures during the execution of an application. Energy management, the goal of which is to optimize of power consumption and to handle thermal issues, is also necessary due to both monetary and environmental constraints since typical applications executed in HPC and/or cloud environments will lead to large power consumption and heat dissipation due to intensive computation and communication workloads. The main objective of this Dagstuhl seminar was to gather two communities: (i)~system-oriented researchers who study high-level resource-provisioning policies, pragmatic resource allocation and scheduling heuristics, novel approaches for designing and deploying systems software infrastructures, and tools for monitoring/measuring the state of the system; and (ii)~algorithm-oriented researchers, who investigate formal models and algorithmic solutions for resilience and energy efficiency problems. Both communities focused around workflow applications during the seminar, and discussed various issues related to the efficient, resilient, and energy efficient execution of workflows in distributed platforms. This report provides a brief executive summary of the seminar and lists all the presented material.

Full Text