Checkpoint Interval Research Articles

AbstractFault‐tolerance is an essential part of a stream processing system that guarantees data analysis could continue even after failures. State‐of‐the‐art distributed stream processing systems use checkpointing to support fault‐tolerance for stateful computations where the state of the computations is periodically persisted. However, the frequency of performing checkpoints impacts the performance (utilization, latency, and throughput) of the system as the checkpointing process consumes resources and time that can be used for actual computations. In practice, systems are often configured to perform checkpoints based on crude values ignoring factors such as checkpoint and restart costs, leading to suboptimal performance. In our previous work, we proposed a theoretical optimal checkpoint interval that maximizes the system utilization for stream processing systems to minimize the impact of checkpointing on system performance. In this article, we investigate the practical benefits of our proposed theoretical optimal by conducting experiments in a real‐world cloud setting using different streaming applications; we use Apache Flink, a well‐known stream processing system for our experiments. The experiment results demonstrate that an optimal interval can achieve better utilization, confirming the practicality of the theoretical model when applied to real‐world applications. We observed utilization improvements from 10% to 200% for a range of failure rates from 0.3 failures per hour to 0.075 failures per minute. Moreover, we explore how performance measures: latency and throughput are affected by the optimal interval. Our observations demonstrate that significant improvements can be achieved using the optimal interval for both latency and throughput.

Read full abstract

Internet of Things (IoT) contributes to improve the quality of life as it supports many applications, especially healthcare systems. Data generated from IoT devices is sent to the Cloud Computing (CC) for processing and storage, despite the latency caused by the distance. Because of the revolution in IoT devices, data sent to CC has been increasing. As a result, another problem added to the latency was increasing congestion on the cloud network. Fog Computing (FC) was used to solve these problems because of its proximity to IoT devices, while filtering data is sent to the CC. FC is a middle layer located between IoT devices and the CC layer. Due to the massive data generated by IoT devices on FC, Dynamic Weighted Round Robin (DWRR) algorithm was used, which represents a load balancing (LB) algorithm that is applied to schedule and distributes data among fog servers by reading CPU and memory values of these servers in order to improve system performance. The results proved that DWRR algorithm provides high throughput which reaches 3290 req/sec at 919 users. A lot of research is concerned with distribution of workload by using LB techniques without paying much attention to Fault Tolerance (FT), which implies that the system continues to operate even when fault occurs. Therefore, we proposed a replication FT technique called primary-backup replication based on dynamic checkpoint interval on FC. Checkpoint was used to replicate new data from a primary server to a backup server dynamically by monitoring CPU values of primary fog server, so that checkpoint occurs only when the CPU value is larger than 0.2 to reduce overhead. The results showed that the execution time of data filtering process on the FC with a dynamic checkpoint is less than the time spent in the case of the static checkpoint that is independent on the CPU status.

Read full abstract

Checkpoint Interval Research Articles

Related Topics

Articles published on Checkpoint Interval

Checkpointing models for tasks of different types

A Dynamic Checkpoint Interval Decision Algorithm for Live Migration-Based Drone-Recovery System

In-Memory Versioning (IMV)

Intelligent Fault-Tolerant Mechanism for Data Centers of Cloud Infrastructure

Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice

Managing Big Interval Data with CINTIA: The Checkpoint INTerval Array

Dynamic Fault Tolerance Aware Scheduling for Healthcare System on Fog Computing

Minimizing Energy and Computation in Long-Running Software

Research on Optimal Checkpointing-Interval for Flink Stream Processing Applications

Efficient Analysis of Repairable Computing Systems Subject to Scheduled Checkpointing

On Providing OS Support to Allow Transparent Use of Traditional Programming Models for Persistent Memory

A utilization model for optimization of checkpoint intervals in distributed stream processing systems

Semi-Stream Similarity Join Processing in a Distributed Environment

An optimal checkpointing model with online OCI adjustment for stream processing applications

Instantaneous Mean-Time-To-Failure (MTTF) estimation for checkpoint interval computation at run time

Checkpointing Strategies for Shared High-Performance Computing Platforms

On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems

The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints

Cloud computing and big data: Technologies and applications

Efficient modeling and optimizing of checkpointing in concurrent component-based software systems

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Checkpoint Interval Research Articles

Related Topics

Articles published on Checkpoint Interval

Checkpointing models for tasks of different types

A Dynamic Checkpoint Interval Decision Algorithm for Live Migration-Based Drone-Recovery System

In-Memory Versioning (IMV)

Intelligent Fault-Tolerant Mechanism for Data Centers of Cloud Infrastructure

Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice

Managing Big Interval Data with CINTIA: The Checkpoint INTerval Array

Dynamic Fault Tolerance Aware Scheduling for Healthcare System on Fog Computing

Minimizing Energy and Computation in Long-Running Software

Research on Optimal Checkpointing-Interval for Flink Stream Processing Applications

Efficient Analysis of Repairable Computing Systems Subject to Scheduled Checkpointing

On Providing OS Support to Allow Transparent Use of Traditional Programming Models for Persistent Memory

A utilization model for optimization of checkpoint intervals in distributed stream processing systems

Semi-Stream Similarity Join Processing in a Distributed Environment

An optimal checkpointing model with online OCI adjustment for stream processing applications

Instantaneous Mean-Time-To-Failure (MTTF) estimation for checkpoint interval computation at run time

Checkpointing Strategies for Shared High-Performance Computing Platforms

On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems

The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints

Cloud computing and big data: Technologies and applications

Efficient modeling and optimizing of checkpointing in concurrent component-based software systems