Component survivability at runtime for mission-critical distributed systems

Joon S Park,Pratheep Chandramohan,Joseph V Giordano,Kevin A Kwiat,Avinash T Suresh

doi:10.1007/s11227-012-0818-2

Abstract

As information systems develop into larger and more complex implementations, the need for survivability in mission-critical systems is pressing. Furthermore, the requirement for protecting information systems becomes increasingly vital, while new threats are identified each day. It becomes more challenging to build systems that will detect such threats and recover from the damage. This is particularly critical for distributed mission-critical systems, which cannot afford a letdown in functionality even though there are internal component failures or compromises with malicious codes, especially in a downloaded component from an external source. Therefore, when using such a component, we should check to see if the source of the component is trusted and that the code has not been modified in an unauthorized manner since it was created. Furthermore, once we find failures or malicious codes in the component, we should fix those problems and continue the original functionality of the component at runtime so that we can support survivability in the mission-critical system. In this paper, we define our definition of survivability, discuss the survivability challenges in component-sharing in a large distributed system, identify the static and dynamic survivability models, and discuss their trade-offs. Consequently, we propose novel approaches for component survivability. Finally, we prove the feasibility of our ideas by implementing component recovery against internal failures and malicious codes based on the dynamic model.

Full Text