A Comprehensive Model for Software Rejuvenation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Recently, the phenomenon of software aging, one in which the state of the software system degrades with time, has been reported. This phenomenon, which may eventually lead to system performance degradation and/or crash/hang failure, is the result of exhaustion of operating system resources, data corruption, and numerical error accumulation. To counteract software aging, a technique called software rejuvenation has been proposed, which essentially involves occasionally terminating an application or a system, cleaning its internal state and/or its environment, and restarting it. Since rejuvenation incurs an overhead, an important research issue is to determine optimal times to initiate this action. In this paper, we first describe how to include faults attributed to software aging in the framework of Gray's software fault classification (deterministic and transient), and study the treatment and recovery strategies for each of the fault classes. We then construct a semi-Markov reward model based on workload and resource usage data collected from the UNIX operating system. We identify different workload states using statistical cluster analysis, estimate transition probabilities, and sojourn time distributions from the data. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource depletion in each state. The model is then solved to obtain estimated times to exhaustion for each resource. The result from the semi-Markov reward model are then fed into a higher-level availability model that accounts for failure followed by reactive recovery, as well as proactive recovery. This comprehensive model is then used to derive optimal rejuvenation schedules that maximize availability or minimize downtime cost.

Similar Papers
  • Conference Article
  • Cite Count Icon 157
  • 10.1109/issre.1999.809313
A measurement-based model for estimation of resource exhaustion in operational software systems
  • Nov 1, 1999
  • K Vaidyanathan + 1 more

Software systems are known to suffer from outages due to transient errors. Recently, the phenomenon of aging, in which the state of the software system degrades with time, has been reported (S. Garg et al., 1998). The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure, or both. Earlier work in this area to detect aging and to estimate its effect on system resources did not take into account the system workload. In this paper, we propose a measurement-based model to estimate the rate of exhaustion of operating system resources both as a function of time and the system workload state. A semi-Markov reward model is constructed based on workload and resource usage data collected from the UNIX operating system. We first identify different workload states using statistical cluster analysis and build a state-space model. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource exhaustion in the different states. The model is then solved to obtain trends and the estimated exhaustion rates and the time-to-exhaustion for the resources. With the help of this measure, proactive fault management techniques such as rejuvenation (Y. Huang et al., 1995) may be employed to prevent unexpected outages.

  • Conference Article
  • 10.1109/simsym.2000.844894
Proactive fault-management in software systems
  • Apr 16, 2000
  • K.S Trivedi

Hardware redundancy is a time-honored technique to enhance reliability. However, when applied to software systems, it is inherently expensive to implement due to the need to employ design diversity. Furthermore, recent studies have reported the transient nature of software failures for which design diversity is not very helpful. Transient failures typically occur because of design faults in software, which result in unacceptable erroneous states in the OS environment of the process. Hence, environment diversity, a generalization of system restart, has been proposed as a cheap yet effective technique for software fault-tolerance. The basic idea here is to modify the operating environment of the running process. Recently, the phenomenon of ?software aging?, one in which the state of the software system degrades with time, has been reported. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure or both. Software aging has been reported in widely used software like Netscape and xrn. Aging in the AT\&T telecommunication software has known to have resulted in packet loss. Numerous other examples exist, in systems with high availability requirements and also in safety-critical systems. To counteract this phenomenon, a proactive approach of fault management, called ?software rejuvenation? has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. This process removes the accumulated errors and frees up operating system resources. The preventive action can be done at optimal times (for example, when the load on the system is low) so that the overhead due to planned system downtime is minimal. This method therefore avoids unplanned and potentially expensive system outages due to software aging. A basic assumption here is that the overhead involved in the planned downtime and performing the clean-up operation is considerably less than the cost incurred due to unplanned system outages.In this talk, we will discuss methods of evaluating the effectiveness of proactive fault management in operational software systems and determining optimal times to perform rejuvenation, for different scenarios. Developing stochastic models which tradeoff the cost of unexpected failures due to software aging with the overhead of proactive fault management does this. We use a Markov regenerative process model with a subordinated non-homogeneous Markov chain. The stochastic models have both theoretical and practical value. Depending on the failure characteristics of the software and the preventive maintenance policies, the appropriate model can be used to obtain optimal rejuvenation intervals based on several criteria. The second half the talk will deal with measurement-based models, which are constructed using workload and resource usage data collected from the UNIX operating system over a period. Methodologies based on statistics and Markov models are used to detect software aging and to estimate its effect on various system resources. The measurement-based models are the first steps towards predicting aging related failures, intended to help development of strategies for software rejuvenation triggered by actual measurements.

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/3297663.3310290
Software Aging and Software Rejuvenation
  • Apr 4, 2019
  • Kishor Trivedi

The study of failures has now become more important since it has been recognized that computer system outages are more due to faults than due to hardware faults. The phenome- non of software aging, in which the state of the system degrades with time, has been reported in widely used and also in high-availability and safety-critical systems. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the system or crash/hang failure or both. To counteract this phenome- non, a proactive approach to fault management, called software rejuvenation has been proposed. This essentially involves grace- fully terminating an application or a system and restarting it in a clean internal state. This process removes the accumulated errors and frees up operating system resources. This method therefore avoids or postpones unplanned and potentially expensive system outages due to aging. In this talk, we discuss methods of evaluating the effectiveness of proactive fault management in operational systems and determining optimal times to perform rejuvenation.

  • Research Article
  • Cite Count Icon 3
  • 10.11591/ijece.v10i6.pp5985-5991
An analysis of software aging in cloud environment
  • Dec 1, 2020
  • International Journal of Electrical and Computer Engineering (IJECE)
  • Shruthi P + 1 more

Cloud Computing is the environment in which several virtual machines (VM) run concurrently on physical machines. The cloud computing infrastructure hosts multiple cloud service segments that communicate with each other using the interfaces. This creates distributed computing environment. During operation, the software systems accumulate errors or garbage that leads to system failure and other hazardous consequences. This status is called software aging. Software aging happens because of memory fragmentation, resource consumption in large scale and accumulation of numerical error. Software aging degrads the performance that may result in system failure. This happens because of premature resource exhaustion. This issue cannot be determined during software testing phase because of the dynamic nature of operation. The errors that cause software aging are of special types. These errors do not disturb the software functionality but target the response time and its environment. This issue is to be resolved only during run time as it occurs because of the dynamic nature of the problem. To alleviate the impact of software aging, software rejuvenation technique is being used. Rejuvenation process reboots the system or re-initiates the softwares. This avoids faults or failure. Software rejuvenation removes accumulated error conditions, frees up deadlocks and defragments operating system resources like memory. Hence, it avoids future failures of system that may happen due to software aging. As service availability is crucial, software rejuvenation is to be carried out at defined schedules without disrupting the service. The presence of Software rejuvenation techniques can make software systems more trustworthy. Software designers are using this concept to improve the quality and reliability of the software. Software aging and rejuvenation has generated a lot of research interest in recent years. This work reviews some of the research works related to detection of software aging and identifies research gaps.

  • Research Article
  • 10.5121/ijaia.2012.3302
Software Aging Analysis of Web Server Using Neural Networks
  • May 31, 2012
  • International Journal of Artificial Intelligence & Applications
  • G Sumathi

Software aging is a phenomenon that refers to progressive performance degradation or transient failures or even crashes in long running software systems such as web servers. It mainly occurs due to the deterioration of operating system resource, fragmentation and numerical error accumulation. A primitive method to fight against software aging is software rejuvenation. Software rejuvenation is a proactive fault management technique aimed at cleaning up the system internal state to prevent the occurrence of more severe crash failures in the future. It involves occasionally stopping the running software, cleaning its internal state and restarting it. An optimized schedule for performing the software rejuvenation has to be derived in advance because a long running application could not be put down now and then as it may lead to waste of cost. This paper proposes a method to derive an accurate and optimized schedule for rejuvenation of a web server (Apache) by using Radial Basis Function (RBF) based Feed Forward Neural Network, a variant of Artificial Neural Networks (ANN). Aging indicators are obtained through experimental setup involving Apache web server and clients, which acts as input to the neural network model. This method is better than existing ones because usage of RBF leads to better accuracy and speed in convergence.

  • Conference Article
  • 10.1109/icoin.2014.6799478
Software aging trend analysis of server virtualized system
  • Feb 1, 2014
  • Biju R Mohan + 1 more

It is well known that software systems suffer from performance degradation due to resource shrinking and this phenomenon is referred to as Software Aging. Long running software systems tend to show degradation in performance due to exhaustion of operating systems resources, data corruption and numerical error accumulation. The primary objective of the paper is to establish the aging trend in the server virtualized system. It establishes the aging trend by showing that the average response time decreases while total available physical memory decreases. Linear regression model has been used to study the aging trend.

  • Research Article
  • Cite Count Icon 288
  • 10.1147/rd.452.0311
Proactive management of software aging
  • Mar 1, 2001
  • IBM Journal of Research and Development
  • V Castelli + 6 more

failures are now known to be a dominant source of system outages. Several studies and much anecdotal evidence point to as a common phenomenon, in which the state of a software system degrades with time. Exhaustion of system resources, data corruption, and numerical error accumulation are the primary symptoms of this degradation, which may eventually lead to performance degradation of the software, crash/hang failure, or other undesirable effects. Software is a proactive technique intended to reduce the probability of future unplanned outages due to aging. The basic idea is to pause or halt the running software, refresh its internal state, and resume or restart it. rejuvenation can be performed by relying on a variety of indicators of aging, or on the time elapsed since the last rejuvenation. In response to the strong desire of customers to be provided with advance notice of unplanned outages, our group has developed techniques that detect the occurrence of software aging due to resource exhaustion, estimate the time remaining until the exhaustion reaches a critical level, and automatically perform proactive software rejuvenation of an application, process group, or entire operating system, depending on the pervasiveness of the resource exhaustion and our ability to pinpoint the source. This technology has been incorporated into the IBM Director for xSeries servers. To quantitatively evaluate the impact of different rejuvenation policies on the availability of cluster systems, we have developed analytical models based on stochastic reward nets (SRNs). For timebased rejuvenation policies, we determined the optimal rejuvenation interval based on system availability and cost. We also analyzed a rejuvenation policy based on prediction, and showed that it can further increase system availability and reduce downtime cost. These models are very general and can capture a multitude of cluster system characteristics, failure behavior, and performability measures, which we are just beginning to explore.

  • Book Chapter
  • 10.1007/978-1-4615-2241-6_37
Approximate Computation of Sojourn Time Distribution in Open Queueing Networks
  • Jan 1, 1995
  • Varsha Mainkar + 2 more

The method of decomposition of queues has been widely used in solution of large and complex queueing networks for which exact solutions do not exist. We apply the basic paradigm of decomposition in computing approximations to the sojourn-time distribution in open queueing networks in which the service times and arrival processes are non-Markovian. For doing so we have made use of existing results on sojourn time distribution at a single queue. Using these, a queueing network is translated into a semi-Markov chain, whose absorption time distribution approximates the sojourn time distribution of the queueing network. However, the semi-Markov model does not represent the state of the queueing network (i.e., number of jobs at each queue). The state-space size of the semi-Markov models is thus linear in the number of queues in the network. This is achieved by having one state in the semi-Markov model corresponding to each queue in the queueing network, and one absorbing state to denote exit out of the network. The states are then connected together according to the topology of the network. The holding time distribution of a state is the sojourn time distribution at the corresponding queue. This sojourn time distribution must be computed by considering each queue in isolation. We approximate the arrival process to each queue to a phase-type arrival process, and then compute the sojourn time distribution assuming it is a PH/G/1 queue. Once we have the holding time distributions and the routing probability matrix, the absorption time distribution of the semi-Markov chain can be computed. The absorption time distribution approximates the sojourn time distribution of the queueing network.

  • Conference Article
  • Cite Count Icon 136
  • 10.1109/simsym.2000.844925
Modeling and analysis of software aging and rejuvenation
  • Apr 16, 2000
  • K.S Trivedi + 2 more

Software systems are known to suffer from outages due to transient errors. Recently, the phenomenon of aging, one in which the state of the software system degrades with time, has been reported. To counteract this phenomenon, a proactive approach of fault management, called has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. We discuss stochastic models to evaluate the effectiveness of proactive fault management in operational software systems and determine optimal times to perform rejuvenation, for different scenarios. The latter part of the paper deals with measurement-based methodologies to detect software aging and estimate its effect on various system resources. Models are constructed using workload and resource usage data collected from the UNIX operating system over a period of time. The measurement-based models are intended to help development of strategies for software rejuvenation triggered by actual measurements.

  • Research Article
  • Cite Count Icon 26
  • 10.1016/j.jss.2006.06.029
Modeling and analysis of software aging and software failure
  • Aug 9, 2006
  • Journal of Systems and Software
  • Letian Jiang + 1 more

Modeling and analysis of software aging and software failure

  • Research Article
  • Cite Count Icon 1
  • 10.30630/joiv.1.4-2.84
Initial Review on ICTS Governance for Software Anti-Aging
  • Nov 16, 2017
  • JOIV : International Journal on Informatics Visualization
  • Mohamad Khairudin Morshidi + 2 more

For the past 20 years various researches regarding software aging have been conducted. Software aging is the situation in which the accumulation of errors occurring in operational software system that has run for a long time that may lead to performance degradation, resource depletion and eventually causing the software to crash or hang [1]. David Parnas divided software aging into two categories: 1) the failure of the software to adapt with environment that is dynamic and 2) the result of the changes itself [2]. Factors that can affects software aging can be classified into several categories: 1) functional, 2) human, 3) product and 4) environment [3]. In general, the factors that affect software aging can be divided into internal and external factors. The main objectives of this paper are to briefly describe the definition of software aging and also ICTS governance. In addition  to that, this paper also compiles the software aging factors that are being investigated by previous researchers. The need for future research regarding ICTS governance and Software aging also determined at the end of this paper.Â

  • Conference Article
  • Cite Count Icon 21
  • 10.1109/issre.2013.6698905
Towards fast OS rejuvenation: An experimental evaluation of fast OS reboot techniques
  • Nov 1, 2013
  • Antonio Bovenzi + 4 more

Continuous or high availability is a key requirement for many modern IT systems. Computer operating systems play an important role in IT systems availability. Due to the complexity of their architecture, they are prone to suffer failures due to several types of software faults. Software aging causes a nonnegligible fraction of these failures. It leads to an accumulation of errors with time, increasing the system failure rate. This phenomenon can be accompanied by performance degradation and eventually system hang or even crash. As a countermeasure, software rejuvenation entails stopping the system, cleaning its internal state, and resuming its operation. This process usually incurs downtime. For an operating system, the downtime impacts any application running on top of it. Several solutions have been developed to speed up the boot time of operating systems in order to reduce the downtime overhead. We present a study of two fast OS reboot techniques for rejuvenation of Linux-based operating systems, namely Kexec and Phase-based reboot. The study measures the performance penalty they introduce and the gain in reduction of downtime overhead. The results reveal that the Kexec and Phase-based reboot have no statistically significant impact in terms of performance penalty from the user perspective. However, they may require extra resource (e.g., CPU) usage. The downtime overhead reduction, compared with normal Linux and VM reboots, is 77% and 79% in Kexec and Phase-based reboot, respectively.

  • Research Article
  • Cite Count Icon 2
  • 10.1109/tnsm.2020.3030589
ARES: A Framework for Management of Aging and Rejuvenation in Softwarized Networks
  • Jun 1, 2021
  • IEEE Transactions on Network and Service Management
  • Petra Vizarreta + 7 more

The recent trend of network softwarization suggests a radical shift in the implementation of traditional network intelligence. In Software Defined Networking (SDN), for instance, the control plane functions of forwarding devices are outsourced to the controller. Softwarized network components are expected to provide uninterrupted service during long periods of time, which makes them prone to the effects of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">software aging</i> , a phenomena that has been observed in operational software systems where the failure rate increases or the performance of the software degrades with the elapsed time since the last restart. The effects of software aging in operational networks are typically mitigated by <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">software rejuvenation</i> , i.e., planned restarts cleaning the internal system state in order to prevent or postpone aging-related failures. This article presents <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ARES</i> , a three-step methodological framework for the management of the effects of software aging in softwarized networks, applied to the case study of open source SDN orchestration platforms. Using ARES, we demonstrate that software aging is a systematic problem that cannot be neglected in network orchestration systems. It stems not only from aging-related bugs and natural aging due to fragmentation, but also from design choices, e.g., when implementing distributed systems. Measurements for Open Network Operating System (ONOS) and OpenDaylight (ODL) demonstrate how “simple” and common networking tasks let network performance degrade rapidly and even lead to crashes: for instance, adding and removing 300 intents per second in ONOS significantly increases the response time by 50% per day and depletes the memory at the rate of 18GB per day. Moreover, we demonstrate a first rejuvenation approach that can mitigate the effects of aging in ONOS.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.36930/40300219
Старіння програмного забезпечення мобільних додатків: аналіз проблематики
  • Jun 4, 2020
  • Scientific Bulletin of UNFU
  • В С Яковина + 1 more

Виконано огляд та аналіз літературних джерел, в яких досліджено явища старіння програмного забезпечення мобільних додатків. Визначено основні характеристики явища старіння програмного забезпечення. Встановлено, що мобільні системи та додатки є особливо вразливі до ефектів старіння і потребують детальних досліджень. Охарактеризовано основні методи та засоби дослідження явища старіння, що застосовуються для його вивчення в мобільній системі Android. Описано загальну схему дослідження явища старіння, яка дає змогу проводити експерименти та визначати наявність чи відсутність старіння в системі, а також вказує на вплив факторів на прояви старіння. Визначено використовувані індикатори старіння, а саме такі індикатори системи та додатків, як тривалість запуску Android Activity, оперативна пам'ять, файлове сховище, використання CPU, Garbage Collector. Виділено основні фактори, що впливають на прояви явища старіння: технічні характеристики пристрою, типи додатків та програмний код, інтенсивність запуску додатків, події введення, оперативна пам'ять та пам'ять файлового сховища. Встановлено, що згідно з результатами попередніх досліджень, ефективними алгоритмами машинного навчання для визначення наявності старіння є метод опорних векторів та дерева прийняття рішень. Проаналізовано наявні дослідження, методи та засоби виконання процедури омолодження програмного забезпечення для зменшення впливу старіння на надійність системи Android. З'ясовано, що для протидії старінню програмного забезпечення в мобільній системі Android пропонують засоби як на рівні розроблення архітектури та реалізації мобільного додатку, так і на системному рівні і рівні компонент. Встановлено, що ключовим засобом протидії старінню є перезавантаження компонент на рівні системи (наприклад, Activity manager) чи додатків (Java-контейнери), а також є необхідність розроблення таких засобів для планування виконання процедури омолодження. Обґрунтовано актуальність впливу явища старіння на забезпечення надійності сучасних мобільних та вбудованих систем. Визначено напрями майбутніх досліджень, а саме: визначення ефективних факторів і індикаторів для мобільних систем, побудова моделей старіння, розроблення методів і засобів омолодження програмного забезпечення мобільних систем.

  • Research Article
  • Cite Count Icon 33
  • 10.1016/j.peva.2013.05.010
A comprehensive approach to optimal software rejuvenation
  • Jul 16, 2013
  • Performance Evaluation
  • Jing Zhao + 5 more

A comprehensive approach to optimal software rejuvenation

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon