FAILURE MANAGEMENT IN GRIDS: THE CASE OF THE EGEE INFRASTRUCTURE

Kyriakos Neocleous,Paraskevi Fragopoulou,Marios D Dikaiakos,Evangelos P Markatos

doi:10.1142/s0129626407003113

Kyriakos Neocleous, Paraskevi Fragopoulou + Show 2 more

Open Access

https://doi.org/10.1142/s0129626407003113

Copy DOI

Abstract

The emergence of Grid infrastructures like EGEE has enabled the deployment of large-scale computational experiments that address challenging scientific problems in various fields. However, to realize their full potential, Grid infrastructures need to achieve a higher degree of dependability, i.e., they need to improve the ratio of Grid-job requests that complete successfully in the presence of Grid-component failures. To achieve this, however, we need to determine, analyze and classify the causes of job failures on Grids. In this paper we study the reasons behind Grid job failures in the context of EGEE, the largest Grid infrastructure currently in operation. We present points of failure in a Grid that affect the execution of jobs, and describe error types and contributing factors. We discuss various information sources that provide users and administrators with indications about failures, and assess their usefulness based on error information accuracy and completeness. We describe two real-life case studies, describing failures that occurred on a production site of EGEE and the troubleshooting process for each case. Finally, we propose the architecture for a system that could provide failure management support to administrators and end-users of large-scale Grid infrastructures like EGEE.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

FAILURE MANAGEMENT IN GRIDS: THE CASE OF THE EGEE INFRASTRUCTURE

Abstract

Talk to us

Similar Papers

More From: Parallel Processing Letters

Lead the way for us

Journal: Parallel Processing Letters	Publication Date: Dec 1, 2007
Citations: 14

Similar Papers

Searching for Software on the EGEE Infrastructure
George Pallis ... Asterios Katsifodimos
Journal of Grid Computing | VOL. 8
George Pallis, et. al.George Pallis ... Asterios Katsifodimos
23 Mar 2010
Journal of Grid Computing | VOL. 8

Timely Rendering Algorithm of Virtualization System Based on CUDA for Smart Scenarios of Power Grid Infrastructure
Chunli Wang ... Yan Yan
-
Chunli Wang, et. al.Chunli Wang ... Yan Yan
29 Jan 2023
29 Jan 2023

Data Management in Production Grids - Challenges and Techniques
E Laure
-
E LaureE Laure
03 Jul 2006
03 Jul 2006

Improvements of common open Grid standards to increase High Throughput and High Performance Computing effectiveness on large-scale Grid and e-science infrastructures
M Riedel ... A Streit
-
M Riedel, et. al.M Riedel ... A Streit
01 Apr 2010
01 Apr 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

FAILURE MANAGEMENT IN GRIDS: THE CASE OF THE EGEE INFRASTRUCTURE

Abstract

Talk to us

Similar Papers

More From: Parallel Processing Letters