Quantitative evaluation of fault propagation in a commercial cloud system

Chao Wang,Zhongchuan Fu

doi:10.1177/1550147720903613

Abstract

As semiconductor technology scales into the nano regime, hardware faults have been threats against computational devices. Cloud systems are incorporating more and more computing density and energy into themselves; thus, fundamental research on topics such as dependability validation is needed, in order to verify the robustness of clouds for sensor networks. However, dependability evaluation studies have often been carried out beyond isolated physical systems, such as processors, sensors, and single boards with or without operating system hosts. These studies have been performed using inaccurate simulations instead of validating complete cloud software stacks (firmware, hypervisor, operating system hosts and workloads) as a whole. In this article, we describe the implementation of a fault injection tool, which validates the dependability of a commercial cloud software stack. Hardware faults induced by high energy density environments can be injected; the fault propagation through the cloud software stack is traced, and quantitatively evaluated. Experimental results show that the integrated fault detection mechanism of the cloud system, such as fatal trap detectors, has left a detection margin of 20% silent data corruption to narrow down. We additionally propose two detection mechanisms, which proved good performance in fault detection of cloud systems.

Highlights

The increasing use of wireless sensors that generate massive data combined with the need to process these data efficiently has given an enormous importance to the cloud computing paradigm.[1]
Dependability evaluation of the software stack for a cloud computing environment Hardware faults may have no effect on the state of the program at runtime, and yield correct result; or they may lead to a state corruption
If the state corruption is not detected by a fault detection mechanism, and if it fails to be masked during the propagation process either, it may result in an silent data corruption (SDC)

Summary

Introduction

The increasing use of wireless sensors that generate massive data combined with the need to process these data efficiently has given an enormous importance to the cloud computing paradigm.[1]. The fault models used in the reliability validation experiment in the ‘‘Experimental results’’ section are presented, through a selective survey of the presence and impact of semiconductor devices lifetime in computing systems. The basic definitions of dependability of computing systems can be traced to 1982 in Lee and Morgan[13] and have been discussed by Avizienis et al.[14] and Salfner et al.[15] In this article, we will use the concepts summarized by Kondo et al.[16] where they describe the basic threats to reliability, namely failures, errors, and faults: Failure: An observable event that occurs when the system deviates from its correct state and it fails to operate as planned. When an error does not cause an external failure, it is a dormant error.

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Quantitative evaluation of fault propagation in a commercial cloud system

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Distributed Sensor Networks

Lead the way for us

Journal: International Journal of Distributed Sensor Networks	Publication Date: Mar 1, 2020
License type: cc-by

Similar Papers

The Influence of Satellite Configuration and Fault Duration Time on the Performance of Fault Detection in GNSS/INS Integration.
Chuang Zhang ... Chunlei Pang
Sensors | VOL. 19
Chuang Zhang, et. al.Chuang Zhang ... Chunlei Pang
09 May 2019
Sensors | VOL. 19

Fault Detection Method Using Multi-mode Principal Component Analysis Based on Gaussian Mixture Model for Sewage Source Heat Pump System
Young-Jun Yoo
International Journal of Control, Automation and Systems | VOL. 17
Young-Jun YooYoung-Jun Yoo
06 May 2019
International Journal of Control, Automation and Systems | VOL. 17

Online Active Fault Detection for Over-Actuated Systems With Prescribed Control Performance
Fangfei Cao ... Xiao He
IEEE Transactions on Automation Science and Engineering | VOL. 21
Fangfei Cao, et. al.Fangfei Cao ... Xiao He
01 Jan 2024
IEEE Transactions on Automation Science and Engineering | VOL. 21

Acceptance Test for Fault Detection in Component-based Cloud Computing and Systems
Mounya Smara ... Zibouda Aliouat
Future Generation Computer Systems | VOL. 70
Mounya Smara, et. al.Mounya Smara ... Zibouda Aliouat
26 Jul 2016
Future Generation Computer Systems | VOL. 70

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Quantitative evaluation of fault propagation in a commercial cloud system

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Distributed Sensor Networks