A formal framework for fault tolerance in hybrid scientific workflows
A formal framework for fault tolerance in hybrid scientific workflows
- Research Article
72
- 10.1016/0045-7906(94)90035-3
- Sep 1, 1994
- Computers & Electrical Engineering
Expert system framework for fault detection and fault tolerance in robotics
- Conference Article
3
- 10.1145/1982185.1982307
- Mar 21, 2011
Due to the benefits of its reusability and productivity, the component-based approach has become the primary technology in service robot software frameworks, such as MRDS (Microsoft Robotics Developer Studio), RTC (Robot Technology Component), ROS (Robot Operating System) and OPRoS (Open Platform for Robotic Services). However, all the existing frameworks are very limited in fault tolerance support, even though the fault tolerance function is crucial for the commercial success of service robots. In this paper, we present a rule-based fault tolerant framework together with widely-used, representative fault tolerance measures. With our observation that most faults in components and applications in service robot systems have common patterns, we equip the framework with the required fault tolerant functions. The system integrators construct fault tolerance applications from non-fault-aware components by declaring fault handling rules in configuration descriptors or/and adding simple helper components, considering the constraints of the components and the operating environment. Much more consistency in system reliability can be obtained with less effort of system developer. Various fault scenarios with a test robot system on the proposed OPRoS fault tolerant framework demonstrate the benefits and effectiveness of the proposed approach.
- Research Article
- 10.1631/fitee.1601450
- Oct 1, 2018
- Frontiers of Information Technology & Electronic Engineering
As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the ‘work-most’ (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.
- Research Article
83
- 10.1016/j.compind.2018.03.027
- Apr 1, 2018
- Computers in Industry
Fault tolerance in cloud computing environment: A systematic survey
- Research Article
36
- 10.1016/j.ins.2021.03.003
- Mar 9, 2021
- Information Sciences
Real-time and dynamic fault-tolerant scheduling for scientific workflows in clouds
- Conference Article
10
- 10.1109/snpd.2018.8441131
- Jun 1, 2018
In the OpenFlow-based software defined networking (SDN), a single controller controls the entire network resources. However, it poses a single point of failure and has restricted processing capacity. Multiple controllers emerged as a solution to ensure network reliability, scalability and high availability for large scale networks. Despite the benefits, multiple controllers also brings about increased complexity with several new challenges affecting network management and schedule. Albeit the centralized controller is suitable for small and medium sized networks, the challenge is how to ensure its reliability and resiliency. This means faults have to be detected and failure recover as quickly as possible. Therefore, this paper proposes a fault tolerance framework (FTF) consisting of three controllers and a FT manager (FTM). The FTM has several components that contribute to FT by monitoring and detecting faults using heartbeat messages and recover from failure using checkpointing. The approach is passive replication where only one controller manages the networks and in the event of failure, another controller is elected using a novel voting technique. Additionally, the issue of network state consistency are handled adequately. We theoretically assessed the FTF using several FT design requirements. The evaluation shows our FTF has an acceptable performance operations in ensuring strict consistency and fault tolerant system.
- Research Article
- 10.55524/ijircst.2022.10.1.14
- Jan 27, 2022
- International Journal of Innovative Research in Computer Science & Technology
Fault tolerance is one of the most crucial concerns in distributed systems. Flout tolerance system is very difficult to implement due to its dynamic nature and complex services. Several research efforts consare istently being made to implement that tolerance in a distributed system. Some recent surveys try to incorporate the several fault tolerance architectures and methodologies proposed for a distributed system. This paper gives a systematic and comprehensive interpretation of different fault types, their causes, and various fault-tolerance approaches used in a distributed system. The paper presents a broad survey of various fault tolerance frameworks in the context of their basic approaches, fault applicability, and other key features. we investigate the different techniques of fault tolerance which is used in a distributed and scalable system. Scalability is an important factor in distributed Systems. It describes the ability of the system to dynamically adjust its own computing performance by changing available computing resources and scheduling methods. The focus of this paper is on types of faults occurring in the system and fault detection techniques. A fault can occur in the system due to the link failure or for any other reason. An appropriate fault detection technique can avoid a loss and save from system failure. The main objective of the fault-tolerant computer system is to continue operating uninterrupted despite the failure of one or more of its components. In the early day’s computer systems were not distributed and they also did not share resources. Now, most of the computers are distributed. They work independently on a common task. So, if one system gets any fault then the other systems will take over the computation of the fault system. The user will not get any issues with his tasks.
- Conference Article
3
- 10.1145/2494621.2494625
- Aug 9, 2013
Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the simulation. To manage these processing requirements in a scalable manner, a discrete event simulation can be distributed across a number of computing resources. However, individual tasks in the simulation are stateful, and therefore require inter-task communication and synchronization to produce an accurate model. This property not only complicates the orchestration of the discrete event simulation in a distributed setting, but also makes providing reliable, fault-tolerant execution a challenge, especially when compared to conventional distributed fault tolerance schemes.In this paper, we propose an autonomous agent that provides fault tolerance functionality for discrete event simulations by predicting state changes in the simulation and adjusting its fault tolerance policy accordingly. This allows the system to avoid negatively impacting overall execution times while preserving reliability guarantees. To underscore the viability of our solution, we provide benchmarks of a production discrete event simulation that can sustain failures while running under the supervision of our fault tolerance framework.
- Conference Article
51
- 10.1109/ftcs.1999.781045
- Jun 15, 1999
We describe a fault tolerance framework for CORBA that provides fault tolerance management and core services, implemented above the ORB for ease of use and customization, and fault tolerance mechanisms, implemented beneath the ORB for transparency and efficiency. Strong replica consistency is facilitated by a multicast engine that provides reliable totally ordered delivery of multicast messages to the replicas of an object. Transparency to the application allows application programmers to focus on their applications rather than on fault tolerance, and transparency to the ORE allows existing commercial CORBA ORBs to be used without modification. The fault tolerance framework adheres to CORBA's objective of interoperability by ensuring that different implementations of the specifications of the framework can interoperate and that non-fault-tolerant objects can interwork with fault-tolerant objects.
- Research Article
5
- 10.5772/54023
- Nov 1, 2012
- International Journal of Advanced Robotic Systems
Recently the component-based approach has become a major trend in intelligent service robot development due to its reusability and productivity. The framework in a component-based system should provide essential services for application components. However, to our knowledge the existing robot frameworks do not yet support fault tolerance service. Moreover, it is often believed that faults can be handled only at the application level. In this paper, by extending the robot framework with the fault tolerance function, we argue that the framework-based fault tolerance approach is feasible and even has many benefits, including that: 1) the system integrators can build fault tolerance applications from non-fault-aware components; 2) the constraints of the components and the operating environment can be considered at the time of integration, which – cannot be anticipated eaily at the time of component development; 3) consistency in system reliability can be obtained even in spite of diverse application component sources. In the proposed construction, we build XML rule files defining the rules for probing and determining the fault conditions of each component, contamination cases from a faulty component, and the possible recovery and safety methods. The rule files are established by a system integrator and the fault manager in the framework controls the fault tolerance process according to the rules. We demonstrate that the fault-tolerant framework can incorporate widely accepted fault tolerance techniques. The effectiveness and real-time performance of the framework-based approach and its techniques are examined by testing an autonomous mobile robot in typical fault scenarios.
- Conference Article
1
- 10.1109/icet.2005.1558933
- Dec 19, 2005
Real time systems have a characteristic that they should be fault tolerant. In this paper, a fault tolerance mechanism for real time systems is proposed. First a model is discussed which is a modification of distributed recovery block and is based on distributed computing. Then a model is proposed which is based on distributed computing along with feed forward artificial neural network methodology. The proposed technique is based on execution of design diverse variants on replicated hardware, and assigning weights to the results produced by variants. Thus the proposed method encompasses both the forward and backward recovery mechanism, but main focus is on forward recovery.
- Conference Article
7
- 10.1109/iceeot.2016.7754760
- Mar 1, 2016
The best pattern recognizers in most instances are human, yet we do not understand how human recognize patterns. The pattern recognition is critical in the human decision task, the more relevant the pattern at your disposal, the better your decision will be. More recently, artificial neural network techniques in pattern recognition have been receiving increasing concentration and awareness. It addressed the question of whether neural networks are inherently fault tolerant. Neural networks were visualized from an abstract functional level rather than a physical implementation level to allow their computational fault tolerance to be assessed and to be understood. The design of a recognition system requires concentrating on the following aspects: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, learning, selection of training and test samples.
- Conference Article
1
- 10.1109/icgccee.2014.6921378
- Mar 1, 2014
Fault handling never became trivial task for the applications since decades as it is abnormal situation that may occur when application running in a production environment. So coming to the distributed application component such as webservice also fault handling needs more sophisticated approach to ensure QoS of service. Recently, Webservices has become a defacto standard for developing loosely coupled distributed application component due to the support of interoperability irrespective of the platform and language however still webservice architecture fails to provide robustness and fault tolerance so that business organizations can provide services to the consumer more accurately and effectively without failing 24∗7 hours. This work proposes a robust and fault tolerance framework to handle faults in webservices and repair the faults in webservices in order to provide smooth and efficient services. A set of diversity and repair actions are also proposed that provides normal resolution of services from manual breakdown of services. In the framework all the stack of webservice standards are used along with fault tolerance capability in existing webservice architecture. Mainly, proposed framework uses two base component that is Agent for maintaining replicas and Controller for fault detection, fault notification and fault confinement in order to provide robustness and fault tolerance in existing webservice architecture.
- Research Article
14
- 10.1016/j.vlsi.2019.09.008
- Sep 23, 2019
- Integration
Fault tolerance in memristive crossbar-based neuromorphic computing systems
- Research Article
54
- 10.1016/j.cosrev.2021.100398
- Apr 10, 2021
- Computer Science Review
Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.