Abstract

HEPCloud is rapidly becoming the primary system for provisioning compute resources for all Fermilab-affiliated experiments. In order to reliably meet the peak demands of the next generation of High Energy Physics experiments, Fermilab must plan to elastically expand its computational capabilities to cover the forecasted need. Commercial cloud and allocation-based High Performance Computing (HPC) resources both have explicit and implicit costs that must be considered when deciding when to provision these resources, and at which scale. In order to support such provisioning in a manner consistent with organizational business rules and budget constraints, we have developed a modular intelligent decision support system (IDSS) to aid in the automatic provisioning of resources spanning multiple cloud providers, multiple HPC centers, and grid computing federations. In this paper, we discuss the goals and architecture of the HEPCloud Facility, the architecture of the IDSS, and our early experience in using the IDSS for automated facility expansion both at Fermi and Brookhaven National Laboratory.

Highlights

  • In this paper we describe the goals and high level architecture of the HEPCloud facility, architecture of the Decision Engine (DE) and our early experience in using the DE for automated facility expansion at Fermi and Brookhaven National Laboratory

  • The Fermilab scientific computing staff supplies software and services to support the physics program and provide essential resources for leading high energy physics (HEP) experiments including US-CMS [5], NOvA [6], g-2 [7], and MicroBooNE [8], along with future experiments DUNE and mu2e. These resources include several types of dedicated and shared resources (CPU, disk, hierarchical storage, including disk cache, tape, tape libraries), for both data intensive and compute intensive scientific work. Support for these resources is currently limited to resources provisioned by and hosted at Fermilab, or to remote resources made available through the Open Science Grid (OSG) [9]

  • HEPCloud intends to mitigate these problems by intelligently extending the current Fermilab compute facility to execute jobs submitted by scientists on a diverse set of resources, including commercial and community clouds, grid federations, and High Performance Computing (HPC) centers

Read more

Summary

Introduction

Included in the DE is a software framework with stages for acquiring data, performing data analytics, and generating decisions using an inference engine. A knowledge base is used to manage all data made available within the running system. Careful attention is paid to the system-wide configuration coherency, addressing the needs of all user groups. In this paper we describe the goals and high level architecture of the HEPCloud facility, architecture of the DE and our early experience in using the DE for automated facility expansion at Fermi and Brookhaven National Laboratory

The HEPCloud Facility
Decision Engine
Decision Engine Architecture
Decision Channel
Knowledge Management System
Decision Cycle
Task Manager
Decision Engine with glideinWMS as the Resource Provisioner
Decision Engine with VC3 as the Resource Provisioner
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call