Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC

Kapil Arya,Andrea Dotti,Gene Cooperman,Peter Elmer

doi:10.1088/1742-6596/523/1/012015

Abstract

Process checkpoint-restart is a technology with great potential for use in HEP workflows. Use cases include debugging, reducing the startup time of applications both in offline batch jobs and the High Level Trigger, permitting job preemption in environments where spare CPU cycles are being used opportunistically and efficient scheduling of a mix of multicore and single-threaded jobs. We report on tests of checkpoint-restart technology using CMS software, Geant4-MT (multi-threaded Geant4), and the DMTCP (Distributed Multithreaded Checkpointing) package. We analyze both single- and multi-threaded applications and test on both standard Intel x86 architectures and on Intel MIC. The tests with multi-threaded applications on Intel MIC are used to consider scalability and performance. These are considered an indicator of what the future may hold for many-core computing.

Highlights

The computing requirements for high energy physics (HEP) projects like the Large Hadron Collider (LHC) [1] at the European Laboratory for Particle Physics (CERN) in Geneva, Switzerland are larger than can be met with resources deployed in a single computing center
This has led to the construction of a global distributing computing system known as the Worldwide LHC Computing Grid (WLCG) [2], which brings together resources from nearly 160 computer centers in 35 countries
In this paper we examine the use of a transparent, user-level checkpointing package for distributed applications called Distributed MultiThreaded CheckPointing (DMTCP) [7]

Summary

Introduction

The computing requirements for high energy physics (HEP) projects like the Large Hadron Collider (LHC) [1] at the European Laboratory for Particle Physics (CERN) in Geneva, Switzerland are larger than can be met with resources deployed in a single computing center. In this case it is useful to be able to “preempt” running opportunistic jobs, checkpoint their state to disk and restart them when opportunistic use is again possible. One possibility for managing such situations would be to checkpoint the job with a single active thread and restart a number of such jobs at a later time together, to keep the full set of CPU cores active.

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Physics: Conference Series	Publication Date: Jun 6, 2014
Citations: 3	License type: cc-by

R Discovery Prime

R Discovery Prime

Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series

Lead the way for us

Similar Papers

SCAN-XP
Tomokatsu Takahashi ... Hiroaki Shiokawa
-
Tomokatsu Takahashi, et. al.Tomokatsu Takahashi ... Hiroaki Shiokawa
14 May 2017
14 May 2017

HostoSink: A Collaborative Scheduling in Heterogeneous Environment
Xiaofei Liao ... Wei Zhang
-
Xiaofei Liao, et. al.Xiaofei Liao ... Wei Zhang
01 Jan 2014
01 Jan 2014

Performance and energy evaluation of data prefetching on intel Xeon Phi
Diana Guttman ... Mahmut Taylan Kandemir
-
Diana Guttman, et. al.Diana Guttman ... Mahmut Taylan Kandemir
01 Mar 2015
01 Mar 2015

First evaluation of the CPU, GPGPU and MIC architectures for real time particle tracking based on Hough transform at the LHC
V Halyo V Halyo ... A Vladimirov
Journal of Instrumentation | VOL. 9
V Halyo V Halyo, et. al.V Halyo V Halyo ... A Vladimirov
01 Apr 2014
Journal of Instrumentation | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series