From off-Line to continuous on-line maintenance

Mauro Pezzè

doi:10.1109/icsm.2012.6405244

Abstract

Software is the cornerstone of the modern society. Many human activities rely on software systems that shall operate seamlessly 24/7, and failures in such systems may cause severe problems and considerable economic loss. To efficiently address a growing variety of increasingly complex activities, software systems rely on sophisticated technologies. Most software systems are assembled from modules and subsystems that are often developed by third party organization, and sometime are not even available at the system build time. This is the case for example of many Web applications that link Web services built and changed independently by third party organizations while the Web applications are running. The progresses of software engineering in the last decades have increased the productivity, reduced the costs and improved the reliability of software products, but have not eliminated the occurrence of field failures. Detecting and removing all faults before deployment is practically too expensive even for systems that are simple and fully available at design time, and impossible when systems are large and complex, and are dynamically linked to modules that may be developed and distributed only after the deployment of the system. The classic stop-and-go maintenance approaches that locate and fix field faults offline before deploying new system versions are important, but not sufficient to guarantee a seamless 24/7 behavior, because the faulty systems remain in operation until the faults have been removed and new systems redeployed [1]. On the other hand, classic fault tolerant approaches that constrain developers' freedom and rely on expensive mechanisms to avoid or mask faults do not match the cost requirements of many modern systems, and do not extend beyond the set of safety critical systems [2]. Self-healing systems and autonomic computing tackle these new challenges by moving activities from design to runtime. In self-healing systems, the borderline between design and runtime activities fades, and both design and maintenance activities must change to enable activities such as fault diagnoses and fixes to be performed fully automatically and at runtime. Maintenance activities rely on information that are usually available at design time are are not part of the system runtime infrastructure. For example, corrective maintenance requires some knowledge about the expected system behavior to locate and fix the faults, while adaptive and perfective maintenance requires some knowledge about libraries and components to identify new modules that better cope with the the changes in the requirements and in the environment. In classic maintenance approaches, this knowledge is mastered by the developers, who gather and use the required information offline to deal with the emerging maintenance problems. In self healing systems the knowledge required for maintenance activities shall be available at runtime. Self healing systems shall be designed with enough embedded knowledge to deal with unplanned events, and shall be able to exploit this information automatically and at runtime to recover from unexpected situations, like field failures. The challenges of designing powerful self-healing systems relies in the ability to minimize the amount of extra knowledge to be provided at design time, while feeding a powerful automatic recovery mechanism. An interesting approach relies on the observation that software systems are redundant by nature, and exploits the intrinsic redundancy of software system to fix faults, thus minimizing the extra effort required at design time to feed the self-healing mechanism [3]. The intrinsic redundancy of software stems from design and reusability practice: the reuse of libraries may results in different ways to achieve the same or similar results, the design for modularity may produce methods with equivalent behavior, backward compatibility may keep deprecated and new implementations in the same system. For example, libraries like Ant and Log4J implement several functionalities already available in the standard Java libraries to improve efficiency or usability, while graphical libraries, like SWT, Swing and AWT, provide overlapping functionality that may be available in systems that include two or more of these libraries. This redundancy is available for free at runtime, and can be exploited both to design self-healing mechanisms that can be automatically activated to solve faulty situations at runtime, and to improve maintenance mechanisms by facilitating failure reproduction and fault localization and fixing.

Full Text