Abstract

In this article, we propose a novel scheme for diagnosing intermittent faults for cloud systems. We have investigated the characteristic of high-level symptomatic behavior on top of a cloud system and identified that (1) arrival counts of high-level symptoms go up with the number of fault injections at different speeds, which may help us to differentiate one fault model from another; (2) the nested level of fatal traps is found to be an indicative of fault duration, which is helpful for fault model diagnosis; (3) fatal traps triggered by certain faulty units is explored, providing useful information for locating faults. Based on these features, an n-dimensional space taking symptom’s arrival rate (grown up skew of the arrival count) as each dimension, which formulates the diagnosis problem as a pattern recognition problem is defined. Then, a backpropagation neural-network-based online hardware fault diagnosis scheme is proposed. Experimental results show that diagnosis accuracy of fault location is 99.2%, the accuracy of fault model is 96.7%, and the latency is affordable. This scheme has been implemented in firmware so that it covers cloud software stacks (virtual machine monitor, virtual machines, and user applications) and incurs zero hardware overhead.

Highlights

  • The new and emerging generation of cyber-physical systems[1] such as those supported by the Internet of things (IoT)[2] posed a new set of requirements to computing systems

  • The diagnosis strategy fundamentally depends on the answers to several key questions, which we investigate in this work: 1. Are detection mechanisms effective for combinational logic under intermittent faults? First, the mechanisms should cover all three fault models

  • Experimental results show that diagnosis accuracy of fault location is 99.2% and accuracy of fault model diagnosis is 96.7%, with latency favorable for hardware recovery mechanisms

Read more

Summary

Introduction

The new and emerging generation of cyber-physical systems[1] such as those supported by the Internet of things (IoT)[2] posed a new set of requirements to computing systems. A good example of cyber-physical applications are those running in the context of vehicular ad hoc networks (VANETs).[3] In these systems, drivers are offered a set of services which might involve congestion information, parking place management, entertainment, vehicle tracking, and so on. Applications of this type need to run efficiently ( expecting fast processing and storage) and to be reliable. Reliability refers to the probability of a system, including all of its hardware and software components, to perform correctly as expected

Objectives
Methods
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.