Proactive Failure Management Research Articles

Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular check-pointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately. Thus, in this paper, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), classification and regression trees (CART) and linear discriminant analysis (LDA). Experimental results indicates that the average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This finding implies that our method can effectively predict all possible future system and application failures within the system.

Read full abstract

In large-scale networked computing systems, component failures become norms instead of exceptions. Failure-aware resource management is crucial for enhancing system availability and achieving high performance. In this paper, we study how to efficiently utilize system resources for high-availability computing with the support of virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for networked computing systems. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes’ reliability states. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity–reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms with optimistic and pessimistic selection strategies to find the best qualified nodes on which to instantiate VMs to run user jobs. We have conducted experiments using failure traces from production systems and the NAS Parallel Benchmark programs on a real-world cluster system. The results show the enhancement of system productivity by using the proposed strategies with practically achievable accuracy of failure prediction. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster. The task completion rate reaches 91.7% with 83.6% utilization of relatively unreliable nodes.

Read full abstract

Proactive Failure Management Research Articles

Articles published on Proactive Failure Management

A Combined System Metrics Approach to Cloud Service Reliability Using Artificial Intelligence

Failure prediction using machine learning in a virtualised HPC system and application

Ensemble of Bayesian Predictors and Decision Trees for Proactive Failure Management in Cloud Computing Systems

Quantifying event correlations for proactive failure management in networked computing systems

Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Proactive Failure Management Research Articles

Articles published on Proactive Failure Management

A Combined System Metrics Approach to Cloud Service Reliability Using Artificial Intelligence

Failure prediction using machine learning in a virtualised HPC system and application

Ensemble of Bayesian Predictors and Decision Trees for Proactive Failure Management in Cloud Computing Systems

Quantifying event correlations for proactive failure management in networked computing systems

Failure-aware resource management for high-availability computing clusters with distributed virtual machines