Abstract

Process checkpoint-restart is a technology with great potential for use in HEP workflows. Use cases include debugging, reducing the startup time of applications both in offline batch jobs and the High Level Trigger, permitting job preemption in environments where spare CPU cycles are being used opportunistically and efficient scheduling of a mix of multicore and single-threaded jobs. We report on tests of checkpoint-restart technology using CMS software, Geant4-MT (multi-threaded Geant4), and the DMTCP (Distributed Multithreaded Checkpointing) package. We analyze both single- and multi-threaded applications and test on both standard Intel x86 architectures and on Intel MIC. The tests with multi-threaded applications on Intel MIC are used to consider scalability and performance. These are considered an indicator of what the future may hold for many-core computing.

Highlights

  • The computing requirements for high energy physics (HEP) projects like the Large Hadron Collider (LHC) [1] at the European Laboratory for Particle Physics (CERN) in Geneva, Switzerland are larger than can be met with resources deployed in a single computing center

  • This has led to the construction of a global distributing computing system known as the Worldwide LHC Computing Grid (WLCG) [2], which brings together resources from nearly 160 computer centers in 35 countries

  • In this paper we examine the use of a transparent, user-level checkpointing package for distributed applications called Distributed MultiThreaded CheckPointing (DMTCP) [7]

Read more

Summary

Introduction

The computing requirements for high energy physics (HEP) projects like the Large Hadron Collider (LHC) [1] at the European Laboratory for Particle Physics (CERN) in Geneva, Switzerland are larger than can be met with resources deployed in a single computing center. In this case it is useful to be able to “preempt” running opportunistic jobs, checkpoint their state to disk and restart them when opportunistic use is again possible. One possibility for managing such situations would be to checkpoint the job with a single active thread and restart a number of such jobs at a later time together, to keep the full set of CPU cores active.

Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.