HEPCloud, an Elastic Hybrid HEP Facility using an Intelligent Decision Support System

Parag Mhashilkar,Jose Caballero Bejar,Dmitry Litvintsev,Stuart Fuess,Eric Vaandering,James Kowalkowski,John Hover,David Dagenhart,Panagiotis Spentzouris,Burt Holzman,Marc Paterno,Anthony Tiradani,Steven Timm,Eileen Berman,Qiming Lu,Alexander Moibenko,Mine Altunay

doi:10.1051/epjconf/201921403060

Abstract

HEPCloud is rapidly becoming the primary system for provisioning compute resources for all Fermilab-affiliated experiments. In order to reliably meet the peak demands of the next generation of High Energy Physics experiments, Fermilab must plan to elastically expand its computational capabilities to cover the forecasted need. Commercial cloud and allocation-based High Performance Computing (HPC) resources both have explicit and implicit costs that must be considered when deciding when to provision these resources, and at which scale. In order to support such provisioning in a manner consistent with organizational business rules and budget constraints, we have developed a modular intelligent decision support system (IDSS) to aid in the automatic provisioning of resources spanning multiple cloud providers, multiple HPC centers, and grid computing federations. In this paper, we discuss the goals and architecture of the HEPCloud Facility, the architecture of the IDSS, and our early experience in using the IDSS for automated facility expansion both at Fermi and Brookhaven National Laboratory.

Highlights

In this paper we describe the goals and high level architecture of the HEPCloud facility, architecture of the Decision Engine (DE) and our early experience in using the DE for automated facility expansion at Fermi and Brookhaven National Laboratory
The Fermilab scientific computing staff supplies software and services to support the physics program and provide essential resources for leading high energy physics (HEP) experiments including US-CMS [5], NOvA [6], g-2 [7], and MicroBooNE [8], along with future experiments DUNE and mu2e. These resources include several types of dedicated and shared resources (CPU, disk, hierarchical storage, including disk cache, tape, tape libraries), for both data intensive and compute intensive scientific work. Support for these resources is currently limited to resources provisioned by and hosted at Fermilab, or to remote resources made available through the Open Science Grid (OSG) [9]
HEPCloud intends to mitigate these problems by intelligently extending the current Fermilab compute facility to execute jobs submitted by scientists on a diverse set of resources, including commercial and community clouds, grid federations, and High Performance Computing (HPC) centers

Summary

Introduction

Included in the DE is a software framework with stages for acquiring data, performing data analytics, and generating decisions using an inference engine. A knowledge base is used to manage all data made available within the running system. Careful attention is paid to the system-wide configuration coherency, addressing the needs of all user groups. In this paper we describe the goals and high level architecture of the HEPCloud facility, architecture of the DE and our early experience in using the DE for automated facility expansion at Fermi and Brookhaven National Laboratory

The HEPCloud Facility

Decision Engine

Decision Engine Architecture

Decision Channel

Knowledge Management System

Decision Cycle

Task Manager

Decision Engine with glideinWMS as the Resource Provisioner

Decision Engine with VC3 as the Resource Provisioner

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EPJ web of conferences	Publication Date: Jan 1, 2019
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

HEPCloud, an Elastic Hybrid HEP Facility using an Intelligent Decision Support System

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ web of conferences

Lead the way for us

Similar Papers

Intelligently-Automated Facilities Expansion with the HEPCloud Decision Engine
Parag Mhashilkar ...
-
Parag Mhashilkar, et. al.Parag Mhashilkar ...
01 May 2018
01 May 2018

Neuroscience Gateway � Cyberinfrastructure Providing Supercomputing Resources for Large Scale Computational Neuroscience Research
Majumdar Amitava ... Yoshimoto Kenneth
Frontiers in Neuroinformatics | VOL. 10
Majumdar Amitava, et. al.Majumdar Amitava ... Yoshimoto Kenneth
01 Jan 2015
Frontiers in Neuroinformatics | VOL. 10

HPC resources for CMS offline computing: An integration and scalability challenge for the Submission Infrastructure
Antonio Pérez-Calero Yzquierdo ... Saqib Haleem
EPJ web of conferences | VOL. -
Antonio Pérez-Calero Yzquierdo, et. al.Antonio Pérez-Calero Yzquierdo ... Saqib Haleem
01 Jan 2024
EPJ web of conferences | VOL. -

Automating Job Monitoring System for an Ecosystem of High Performance Computing
Kajornsak Piyoungkorn ... Phithak Thaenkaew
-
Kajornsak Piyoungkorn, et. al.Kajornsak Piyoungkorn ... Phithak Thaenkaew
07 Nov 2017
07 Nov 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

HEPCloud, an Elastic Hybrid HEP Facility using an Intelligent Decision Support System

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ web of conferences