ОБЕСПЕЧЕНИЕ ОТКАЗОУСТОЙЧИВОСТИ ВЫСОКОПРОИЗВОДИТЕЛЬНЫХ ВЫЧИСЛЕНИЙ С ПОМОЩЬЮ ЛОКАЛЬНЫХ КОНТРОЛЬНЫХ ТОЧЕК

Андрій Володимирович Бондаренко ,М.в Якобовский

doi:10.14529/cmse140302

Abstract

One of the main problems that occur in the area of high-performance computing is to continue computations despite of failures. In this paper, we consider the main definitions relating to dependability, briefly review the failure rates for distributed systems and also survey the rollback recovery approaches. The classic fault tolerance technique used in parallel applications is the coordinated checkpointing protocol. This protocol takes a consistent global checkpoint snapshot by capturing the local state of each process node simultaneously and saves it on a parallel file system via I/O nodes. However, as the number of compute nodes increases and the size of applications grow, the performance overhead of this protocol can reach an unacceptable level. A solution to this problem is to use local storage for checkpointing. To provide protection, it is necessary to duplicate checkpoints to other local storages. In this work, we develop user level approach and present scheme for checkpointing to the local storages. We proof that, if the number of failures is less than the maximum allowable value for the scheme then it is possible to recover from consistent global checkpoint.

Highlights

Отказы в распределенных системахПрежде чем говорить об отказоустойчивости, приведем основные определения и рассмотрим характеристики отказов в распределенных системах
Современные суперкомпьютеры состоят из десятков тысяч узлов, каждый из которых оснащен процессорами и, как правило, различными ускорителями
В её рамках для каждого MPI-процесса определяются номера вычислительных узлов, в память которых должны быть сохранены копии локальной контрольной точки

Summary

Отказы в распределенных системах

Прежде чем говорить об отказоустойчивости, приведем основные определения и рассмотрим характеристики отказов в распределенных системах. Система состоит из набора компонент, каждая из которых, в свою очередь, сама является системой, со своей внутренней структурой. Входящие во множество взаимодействующих систем уровня ݊, являются подсистемами для системы уровня ݊ + 1. Каждая система уровня ݊ состоит из множества подсистем уровня ݊ − 1, которые в свою очередь состоят из подсистем уровня ݊ − 2 и т.д. Поведением системы является то, что система делает, чтобы реализовать свою функцию. Корректным называется поведение, обеспечивающее реализацию функции системы. Это измененное состояние компонент системы называется ошибкой (error). Что поведение системы не может реализовать свою функцию. Может наступить такое событие, когда осуществляемое поведение отклоняется от корректного, то есть система не реализует ожидаемую функцию. Сбой вызывает ошибку, которая может и не привести к отказу системы

Модели сбоев

Статистика отказов

Обеспечение отказоустойчивости для распределенных систем

Методы восстановления

Распространенные программные решения

Стандарт MPI и расширение ULFM

Память вычислительных узлов

Схема сохранения локальных контрольных точек

Восстановление вычислений после возникновения отказа

28. MVAPICH

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

ОБЕСПЕЧЕНИЕ ОТКАЗОУСТОЙЧИВОСТИ ВЫСОКОПРОИЗВОДИТЕЛЬНЫХ ВЫЧИСЛЕНИЙ С ПОМОЩЬЮ ЛОКАЛЬНЫХ КОНТРОЛЬНЫХ ТОЧЕК

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bulletin of the South Ural State University. Series "Computational Mathematics and Software Engineering"

Lead the way for us

Journal: Bulletin of the South Ural State University. Series "Computational Mathematics and Software Engineering"	Publication Date: Sep 1, 2014
License type: cc-by

Similar Papers

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale
Bogdan Nicolae ... Adam Moody
-
Bogdan Nicolae, et. al.Bogdan Nicolae ... Adam Moody
01 May 2019
01 May 2019

A parallel and fault tolerant file system based on NFS servers
F Garcia ... J.M Perez
-
F Garcia, et. al.F Garcia ... J.M Perez
01 Jan 2003
01 Jan 2003

Bridging the Gap Between Parallel File Systems and Local File Systems: A Case Study with PVFS
Peng Gu ... Robert Ross
-
Peng Gu, et. al.Peng Gu ... Robert Ross
01 Sep 2008
01 Sep 2008

A Study of Failure Recovery and Logging of High-Performance Parallel File Systems
Runzhou Han ... Om Rameshwar Gatla
ACM Transactions on Storage | VOL. 18
Runzhou Han, et. al.Runzhou Han ... Om Rameshwar Gatla
28 Apr 2022
ACM Transactions on Storage | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ОБЕСПЕЧЕНИЕ ОТКАЗОУСТОЙЧИВОСТИ ВЫСОКОПРОИЗВОДИТЕЛЬНЫХ ВЫЧИСЛЕНИЙ С ПОМОЩЬЮ ЛОКАЛЬНЫХ КОНТРОЛЬНЫХ ТОЧЕК

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bulletin of the South Ural State University. Series "Computational Mathematics and Software Engineering"