On Timely Staging of HPC Job Input Data

Henry M Monti,Ali R Butt,Sudharshan S Vazhkudai

doi:10.1109/tpds.2012.279

Abstract

Innovative scientific applications and emerging dense data sources are creating a data deluge for high-end supercomputing systems. Modern applications are often collaborative in nature, with a distributed user base for input and output data sets. Processing such large input data typically involves copying (or staging) the data onto the supercomputer's specialized high-speed storage, scratch space, for sustained high I/O throughput. This copying is crucial as remotely accessing the data while an application executes results in unnecessary delays and consequently performance degradation. However, the current practice of conservatively staging data as early as possible makes the data vulnerable to storage failures, which may entail restaging and reduced job throughput. To address this, we present a timely staging framework that uses a combination of job start-up time predictions, user-specified volunteer or cloud-based intermediate storage nodes, and decentralized data delivery to coincide input data staging with job start-up. Evaluation of our approach using both PlanetLab and Azure cloud services, as well as simulations based on three years of Jaguar supercomputer (No. 3 in Top500) job logs show as much as 91.0 percent reduction in staging times compared to direct transfers, 75.2 percent reduction in wait time on scratch, and 2.4 percent reduction in usage/hour. (An earlier version of this paper appears in [30].).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

On Timely Staging of HPC Job Input Data

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Sep 1, 2013
Citations: 6

Similar Papers

Reconciling scratch space consumption, exposure, and volatility to achieve timely staging of job input data
Henry M Monti ... Sudharshan S Vazhkudai
-
Henry M Monti, et. al.Henry M Monti ... Sudharshan S Vazhkudai
01 Jan 2009
01 Jan 2009

Towards unified mechanisms for inter-processor communication
Manolis G.H Katevenis
-
Manolis G.H KatevenisManolis G.H Katevenis
01 Jul 2008
01 Jul 2008

Scheduling of advanced practice providers at Level 1 trauma centers.
Robert A Myers ... Pratik J Parikh
The journal of trauma and acute care surgery | VOL. 77
Robert A Myers, et. al.Robert A Myers ... Pratik J Parikh
01 Jul 2014
The journal of trauma and acute care surgery | VOL. 77

Reducing Wait Times for Radiology Exams Around Holiday Periods: A Monte Carlo Simulation.
Vivek A Pisharody ... Vlasios S Sotirchos
Journal of Digital Imaging | VOL. 36
Vivek A Pisharody, et. al.Vivek A Pisharody ... Vlasios S Sotirchos
07 Nov 2022
Journal of Digital Imaging | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On Timely Staging of HPC Job Input Data

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems