A formal framework for fault tolerance in hybrid scientific workflows

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

A formal framework for fault tolerance in hybrid scientific workflows

Similar Papers
  • Research Article
  • Cite Count Icon 72
  • 10.1016/0045-7906(94)90035-3
Expert system framework for fault detection and fault tolerance in robotics
  • Sep 1, 1994
  • Computers & Electrical Engineering
  • M.L Visinsky + 2 more

Expert system framework for fault detection and fault tolerance in robotics

  • Conference Article
  • Cite Count Icon 3
  • 10.1145/1982185.1982307
Fault tolerant framework and techniques for component-based autonomous robot systems
  • Mar 21, 2011
  • Heejune Ahn + 3 more

Due to the benefits of its reusability and productivity, the component-based approach has become the primary technology in service robot software frameworks, such as MRDS (Microsoft Robotics Developer Studio), RTC (Robot Technology Component), ROS (Robot Operating System) and OPRoS (Open Platform for Robotic Services). However, all the existing frameworks are very limited in fault tolerance support, even though the fault tolerance function is crucial for the commercial success of service robots. In this paper, we present a rule-based fault tolerant framework together with widely-used, representative fault tolerance measures. With our observation that most faults in components and applications in service robot systems have common patterns, we equip the framework with the required fault tolerant functions. The system integrators construct fault tolerance applications from non-fault-aware components by declaring fault handling rules in configuration descriptors or/and adding simple helper components, considering the constraints of the components and the operating environment. Much more consistency in system reliability can be obtained with less effort of system developer. Various fault scenarios with a test robot system on the proposed OPRoS fault tolerant framework demonstrate the benefits and effectiveness of the proposed approach.

  • Research Article
  • 10.1631/fitee.1601450
FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing
  • Oct 1, 2018
  • Frontiers of Information Technology & Electronic Engineering
  • Wei Hu + 2 more

As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the ‘work-most’ (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.

  • Research Article
  • Cite Count Icon 83
  • 10.1016/j.compind.2018.03.027
Fault tolerance in cloud computing environment: A systematic survey
  • Apr 1, 2018
  • Computers in Industry
  • Moin Hasan + 1 more

Fault tolerance in cloud computing environment: A systematic survey

  • Research Article
  • Cite Count Icon 36
  • 10.1016/j.ins.2021.03.003
Real-time and dynamic fault-tolerant scheduling for scientific workflows in clouds
  • Mar 9, 2021
  • Information Sciences
  • Zhongjin Li + 5 more

Real-time and dynamic fault-tolerant scheduling for scientific workflows in clouds

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/snpd.2018.8441131
SDN-SDWSN Controller Fault Tolerance Framework for Small to Medium Sized Networks
  • Jun 1, 2018
  • Bassey Isong + 2 more

In the OpenFlow-based software defined networking (SDN), a single controller controls the entire network resources. However, it poses a single point of failure and has restricted processing capacity. Multiple controllers emerged as a solution to ensure network reliability, scalability and high availability for large scale networks. Despite the benefits, multiple controllers also brings about increased complexity with several new challenges affecting network management and schedule. Albeit the centralized controller is suitable for small and medium sized networks, the challenge is how to ensure its reliability and resiliency. This means faults have to be detected and failure recover as quickly as possible. Therefore, this paper proposes a fault tolerance framework (FTF) consisting of three controllers and a FT manager (FTM). The FTM has several components that contribute to FT by monitoring and detecting faults using heartbeat messages and recover from failure using checkpointing. The approach is passive replication where only one controller manages the networks and in the event of failure, another controller is elected using a novel voting technique. Additionally, the issue of network state consistency are handled adequately. We theoretically assessed the FTF using several FT design requirements. The evaluation shows our FTF has an acceptable performance operations in ensuring strict consistency and fault tolerant system.

  • Research Article
  • 10.55524/ijircst.2022.10.1.14
A System Model of Fault Tolerance Technique in the Distributed and Scalable System: A Review
  • Jan 27, 2022
  • International Journal of Innovative Research in Computer Science & Technology
  • Deepika Dhawan + 2 more

Fault tolerance is one of the most crucial concerns in distributed systems. Flout tolerance system is very difficult to implement due to its dynamic nature and complex services. Several research efforts consare istently being made to implement that tolerance in a distributed system. Some recent surveys try to incorporate the several fault tolerance architectures and methodologies proposed for a distributed system. This paper gives a systematic and comprehensive interpretation of different fault types, their causes, and various fault-tolerance approaches used in a distributed system. The paper presents a broad survey of various fault tolerance frameworks in the context of their basic approaches, fault applicability, and other key features. we investigate the different techniques of fault tolerance which is used in a distributed and scalable system. Scalability is an important factor in distributed Systems. It describes the ability of the system to dynamically adjust its own computing performance by changing available computing resources and scheduling methods. The focus of this paper is on types of faults occurring in the system and fault detection techniques. A fault can occur in the system due to the link failure or for any other reason. An appropriate fault detection technique can avoid a loss and save from system failure. The main objective of the fault-tolerant computer system is to continue operating uninterrupted despite the failure of one or more of its components. In the early day’s computer systems were not distributed and they also did not share resources. Now, most of the computers are distributed. They work independently on a common task. So, if one system gets any fault then the other systems will take over the computation of the fault system. The user will not get any issues with his tasks.

  • Conference Article
  • Cite Count Icon 3
  • 10.1145/2494621.2494625
Autonomous, failure-resilient orchestration of distributed discrete event simulations
  • Aug 9, 2013
  • Matthew Malensek + 3 more

Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the simulation. To manage these processing requirements in a scalable manner, a discrete event simulation can be distributed across a number of computing resources. However, individual tasks in the simulation are stateful, and therefore require inter-task communication and synchronization to produce an accurate model. This property not only complicates the orchestration of the discrete event simulation in a distributed setting, but also makes providing reliable, fault-tolerant execution a challenge, especially when compared to conventional distributed fault tolerance schemes.In this paper, we propose an autonomous agent that provides fault tolerance functionality for discrete event simulations by predicting state changes in the simulation and adjusting its fault tolerance policy accordingly. This allows the system to avoid negatively impacting overall execution times while preserving reliability guarantees. To underscore the viability of our solution, we provide benchmarks of a production discrete event simulation that can sustain failures while running under the supervision of our fault tolerance framework.

  • Conference Article
  • Cite Count Icon 51
  • 10.1109/ftcs.1999.781045
A fault tolerance framework for CORBA
  • Jun 15, 1999
  • L.E Moser + 2 more

We describe a fault tolerance framework for CORBA that provides fault tolerance management and core services, implemented above the ORB for ease of use and customization, and fault tolerance mechanisms, implemented beneath the ORB for transparency and efficiency. Strong replica consistency is facilitated by a multicast engine that provides reliable totally ordered delivery of multicast messages to the replicas of an object. Transparency to the application allows application programmers to focus on their applications rather than on fault tolerance, and transparency to the ORE allows existing commercial CORBA ORBs to be used without modification. The fault tolerance framework adheres to CORBA's objective of interoperability by ensuring that different implementations of the specifications of the framework can interoperate and that non-fault-tolerant objects can interwork with fault-tolerant objects.

  • Research Article
  • Cite Count Icon 5
  • 10.5772/54023
A Framework-Based Approach for Fault-Tolerant Service Robots
  • Nov 1, 2012
  • International Journal of Advanced Robotic Systems
  • Heejune Ahn + 2 more

Recently the component-based approach has become a major trend in intelligent service robot development due to its reusability and productivity. The framework in a component-based system should provide essential services for application components. However, to our knowledge the existing robot frameworks do not yet support fault tolerance service. Moreover, it is often believed that faults can be handled only at the application level. In this paper, by extending the robot framework with the fault tolerance function, we argue that the framework-based fault tolerance approach is feasible and even has many benefits, including that: 1) the system integrators can build fault tolerance applications from non-fault-aware components; 2) the constraints of the components and the operating environment can be considered at the time of integration, which – cannot be anticipated eaily at the time of component development; 3) consistency in system reliability can be obtained even in spite of diverse application component sources. In the proposed construction, we build XML rule files defining the rules for probing and determining the fault conditions of each component, contamination cases from a faulty component, and the possible recovery and safety methods. The rule files are established by a system integrator and the fault manager in the framework controls the fault tolerance process according to the rules. We demonstrate that the fault-tolerant framework can incorporate widely accepted fault tolerance techniques. The effectiveness and real-time performance of the framework-based approach and its techniques are examined by testing an autonomous mobile robot in typical fault scenarios.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icet.2005.1558933
A framework for fault tolerance in distributed real time systems
  • Dec 19, 2005
  • S Malik + 1 more

Real time systems have a characteristic that they should be fault tolerant. In this paper, a fault tolerance mechanism for real time systems is proposed. First a model is discussed which is a modification of distributed recovery block and is based on distributed computing. Then a model is proposed which is based on distributed computing along with feed forward artificial neural network methodology. The proposed technique is based on execution of design diverse variants on replicated hardware, and assigning weights to the results produced by variants. Thus the proposed method encompasses both the forward and backward recovery mechanism, but main focus is on forward recovery.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/iceeot.2016.7754760
Artificial neural network: Framework for fault tolerance and future
  • Mar 1, 2016
  • Farhana Kausar + 1 more

The best pattern recognizers in most instances are human, yet we do not understand how human recognize patterns. The pattern recognition is critical in the human decision task, the more relevant the pattern at your disposal, the better your decision will be. More recently, artificial neural network techniques in pattern recognition have been receiving increasing concentration and awareness. It addressed the question of whether neural networks are inherently fault tolerant. Neural networks were visualized from an abstract functional level rather than a physical implementation level to allow their computational fault tolerance to be assessed and to be understood. The design of a recognition system requires concentrating on the following aspects: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, learning, selection of training and test samples.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icgccee.2014.6921378
A novel robust & fault tolerance framework for Webservices using WS-I* specification
  • Mar 1, 2014
  • Akhilesh Kumar Pandey + 2 more

Fault handling never became trivial task for the applications since decades as it is abnormal situation that may occur when application running in a production environment. So coming to the distributed application component such as webservice also fault handling needs more sophisticated approach to ensure QoS of service. Recently, Webservices has become a defacto standard for developing loosely coupled distributed application component due to the support of interoperability irrespective of the platform and language however still webservice architecture fails to provide robustness and fault tolerance so that business organizations can provide services to the consumer more accurately and effectively without failing 24∗7 hours. This work proposes a robust and fault tolerance framework to handle faults in webservices and repair the faults in webservices in order to provide smooth and efficient services. A set of diversity and repair actions are also proposed that provides normal resolution of services from manual breakdown of services. In the framework all the stack of webservice standards are used along with fault tolerance capability in existing webservice architecture. Mainly, proposed framework uses two base component that is Agent for maintaining replicas and Controller for fault detection, fault notification and fault confinement in order to provide robustness and fault tolerance in existing webservice architecture.

  • Research Article
  • Cite Count Icon 14
  • 10.1016/j.vlsi.2019.09.008
Fault tolerance in memristive crossbar-based neuromorphic computing systems
  • Sep 23, 2019
  • Integration
  • Qi Xu + 6 more

Fault tolerance in memristive crossbar-based neuromorphic computing systems

  • Research Article
  • Cite Count Icon 54
  • 10.1016/j.cosrev.2021.100398
Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment
  • Apr 10, 2021
  • Computer Science Review
  • Muhammad Asim Shahid + 4 more

Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.