Periodic Checkpointing Research Articles

AbstractThe industrial internet of things (IIoT) encompasses smart devices, manufacturing systems, humans, and networks for automated productive outcomes. The placement of devices and networks is vulnerable to distributed denial of service (DDOS) attacks that degrade the productivity and efficiency of IIoT. In this article, we propose a checkpoint‐intrigued adversary mitigation scheme (CIAMS) for improving the security features and recommendations of the detection systems. Features that use the recommendation to provide relevant information maintain their security level. A DDoS attack is dealt with at the outset, resulting in increased productivity. The IIoT's smart devices are less productive and efficient because of this DDoS attack. This CIAMS method is designed to address vulnerability and the ability to survive the features checkpoints. The proposed scheme substantiates the security breach and lag in the checkpoint systems against DDOS attacks. The checkpoints' vulnerability level and surviving features are assessed using a classified learning approach. In this assessment, the degrading features are reimbursed by improving the security functions, control, and access methods. Periodic checkpoint replacement and mutual security measures are used for mitigating the prolonging DDOS impact in the network. The proposed scheme's performance is verified using false positives, service distribution, lag, an efficiency score. Improvements have been made to the industrial environment's service delivery and efficiency. By reducing false positives by 10.35%, the proposed scheme improves service distribution ratio and efficiency score by 11.68% and 12.55% for different devices.

Read full abstract

In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible, so developers are left to choose between slowing down training via extensive conservative logging, or letting training run fast via minimalist optimistic logging that may omit key information. As a compromise, optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint---a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for hindsight logging. Our goal is for experienced model developers to learn and adopt these practices. To make this easier, we provide an open-source suite of tools for Fast Low-Overhead Recovery (flor) that embodies our design across three tasks: (i) efficient background logging in Python, (ii) adaptive periodic checkpointing, and (iii) an instrumentation library that codifies hindsight logging for efficient and automatic record-replay of model-training. Model developers can use each flor tool separately as they see fit, or they can use flor in hands-free mode, entrusting it to instrument their code end-to-end for efficient record-replay. Our solutions leverage techniques from physiological transaction logs and recovery in database systems. Evaluations on modern ML benchmarks demonstrate that flor can produce fast checkpointing with small user-specifiable overheads (e.g. 7%), and still provide hindsight log replay times orders of magnitude faster than restarting training from scratch.

Read full abstract

Periodic Checkpointing Research Articles

Related Topics

Articles published on Periodic Checkpointing

Checkpointing models for tasks of different types

A Checkpointing Recovery Approach for Soft Errors Based on Detector Locations

CIAMS—Checkpoint‐intrigued adversary mitigation scheme for industrial internet of things

A Crash Recovery Scheme for a Hybrid Mapping FTL in NAND Flash Storage Devices

FATM: A failure‐aware adaptive fault tolerance model for distributed stream processing systems

Hindsight logging for model training

Joint optimal checkpointing and rejuvenation policy for real-time computing tasks

Failure Analysis Modelling in an Infrastructure as a Service (Iaas) Environment

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Shrink

Fault Tolerance on Large Scale Systems using Adaptive Process Replication

Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

A Trust-based Uncoordinated Checkpointing Algorithm in Mobile Ad Hoc Networks (MANETs)

Data Compression for the Exascale Computing Era - Survey

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Using group replication for resilience on exascale systems

Special Issue: Euro‐Par 2012

Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI

PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS

An optimistic checkpoint mechanism based on job characteristics and resource availability for dynamic grids

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Periodic Checkpointing Research Articles

Related Topics

Articles published on Periodic Checkpointing

Checkpointing models for tasks of different types

A Checkpointing Recovery Approach for Soft Errors Based on Detector Locations

CIAMS—Checkpoint‐intrigued adversary mitigation scheme for industrial internet of things

A Crash Recovery Scheme for a Hybrid Mapping FTL in NAND Flash Storage Devices

FATM: A failure‐aware adaptive fault tolerance model for distributed stream processing systems

Hindsight logging for model training

Joint optimal checkpointing and rejuvenation policy for real-time computing tasks

Failure Analysis Modelling in an Infrastructure as a Service (Iaas) Environment

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Shrink

Fault Tolerance on Large Scale Systems using Adaptive Process Replication

Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

A Trust-based Uncoordinated Checkpointing Algorithm in Mobile Ad Hoc Networks (MANETs)

Data Compression for the Exascale Computing Era - Survey

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Using group replication for resilience on exascale systems

Special Issue: Euro‐Par 2012

Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI

PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS

An optimistic checkpoint mechanism based on job characteristics and resource availability for dynamic grids