Resilience within Ultrascale Computing System: Challenges and Opportunities from Nesus Project

Pascal Bouvry ,Rudolf Mayer ,Jakub Muszyński ,Tuan Anh Trinh ,Dana Petcu ,Andreas Rauber ,Gianluca Tempesti ,Sébastien Varrette

doi:10.14529/jsfi150203

Abstract

Although resilience is already an established field in system science and many methodologies and approaches are available to deal with it, the unprecedented scales of computing, of the massive data to be managed, new network technologies, and drastically new forms of massive scale applications bring new challenges that need to be addressed. This paper reviews the challenges and approaches of resilience in ultrascale computing systems from multiple perspectives involving and addressing the resilience aspects of hardware-software co-design for ultrascale systems, resilience against security attacks, new approaches and methodologies to resilience in ultrascale systems, applications and case studies.

Highlights

Ultrascale computing is a new computing paradigm that comes naturally from the necessity of computing systems that should be able to handle massive data in possibly very large scale distributed systems, enabling new forms of applications that can serve a very large amount of users and in a timely manner that we have never experienced before.Ultrascale Computing Systems (UCSs) are envisioned as large-scale complex systems joining parallel and distributed computing systems that will be two to three orders of magnitude larger than today’s systems (considering the number of Central Processing Unit (CPU) cores)
While it is obviously difficult to predict future developments in processing architectures with high accuracy, we have identified two major trends that are likely to affect big data processing: the development of many-core devices and hardware/software codesign
Memory elements are probably the hardware components that require the highest degree of fault tolerance: their extremely regular structure implies that transistor density in memories is substantially greater than in any other device

Summary

Introduction

Ultrascale computing is a new computing paradigm that comes naturally from the necessity of computing systems that should be able to handle massive data in possibly very large scale distributed systems, enabling new forms of applications that can serve a very large amount of users and in a timely manner that we have never experienced before. Many new services and applications will be able to get advantage of ultrascale platforms such as big data analytics, life science genomics and HPC sequencing, high energy physics (such as QCD), scalable robust multiscale and multi-physics methods and diverse applications for analysing large and heterogeneous data sets related to social, financial, and industrial contexts These applications have a need for Ultrascale Computing Systems (UCSs) due to scientific goals to simulate larger problems within a reasonable time period. Faults are the cause of errors (reflected in the state) which without proper handling may lead to failures (wrong and unexpected outcome) Following these definitions, fault tolerance is an ability of a system to behave in a well-defined manner once an error occurs

Fault models for distributed computing

Dependable computing

Fault tolerance

Robustness

Lessons Learned from Big Data Hardware

Fault tolerance in digital hardware

Fundamental techniques

Data or information redundancy

Hardware redundancy

Time redundancy

Fault tolerant design

Memories

Programmable logic

Single processing cores

On-chip networks

Many-Core arrays

Toward Inherent Software Resilience

Repeatibility and Reproducibility Challenges in UCSs

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Supercomputing Frontiers and Innovations	Publication Date: Jun 1, 2015
Citations: 51	License type: cc-by

R Discovery Prime

R Discovery Prime

Resilience within Ultrascale Computing System: Challenges and Opportunities from Nesus Project

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Supercomputing Frontiers and Innovations

Lead the way for us

Similar Papers

Making Sense of the Data Explosion: The Promise of Systems Science
Patricia L Mabry
American Journal of Preventive Medicine | VOL. 40
Patricia L MabryPatricia L Mabry
24 Apr 2011
American Journal of Preventive Medicine | VOL. 40

Using community engagement to implement evidence-based practices for opioid use disorder: A data-driven paradigm & systems science approach
Nabila El-Bassel ... Bruce Rapkin
Drug and Alcohol Dependence | VOL. 222
Nabila El-Bassel, et. al.Nabila El-Bassel ... Bruce Rapkin
18 Mar 2021
Drug and Alcohol Dependence | VOL. 222

Agent-Based Models and Systems Science Approaches to Public Health
Paul P Maglio ... Patricia L Mabry
American Journal of Preventive Medicine | VOL. 40
Paul P Maglio, et. al.Paul P Maglio ... Patricia L Mabry
16 Feb 2011
American Journal of Preventive Medicine | VOL. 40

Systems science and oral health: Implications for Dental Public Health?
...
Community dental health | VOL. 36
, et. al. ...
25 Feb 2019
Community dental health | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Resilience within Ultrascale Computing System: Challenges and Opportunities from Nesus Project

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Supercomputing Frontiers and Innovations